Downloaded from genome.cshlp.org on March 16, 2016 - Published by Cold Spring Harbor Laboratory Press

Research Indigenous are descendants of the earliest split from ancient Eurasian populations

Juan L. Rodriguez-Flores,1,8 Khalid Fakhro,2,3,8 Francisco Agosto-Perez,1,4 Monica D. Ramstetter,4 Leonardo Arbiza,4 Thomas L. Vincent,1 Amal Robay,3 Joel A. Malek,3 Karsten Suhre,5 Lotfi Chouchane,3 Ramin Badii,6 Ajayeb Al-Nabet Al-Marri,6 Charbel Abi Khalil,3 Mahmoud Zirie,7 Amin Jayyousi,7 Jacqueline Salit,1 Alon Keinan,4 Andrew G. Clark,4 Ronald G. Crystal,1,9 and Jason G. Mezey1,4,9 1Department of Genetic Medicine, Weill Cornell Medical College, New York, New York 10065, USA; 2Sidra Medical and Research Center, Doha, Qatar; 3Department of Genetic Medicine, Weill Cornell Medical College–Qatar, Doha, Qatar; 4Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14850, USA; 5Bioinformatics Core, Weill Cornell Medical College–Qatar, Doha, Qatar; 6Laboratory Medicine and Pathology, Hamad Medical Corporation, Doha, Qatar; 7Department of Medicine, Hamad Medical Corporation, Doha, Qatar

An open question in the history of migration is the identity of the earliest Eurasian populations that have left con- temporary descendants. The was the initial site of the out-of- migrations that occurred between 125,000 and 60,000 yr ago, leading to the hypothesis that the first Eurasian populations were established on the Peninsula and that contemporary indigenous Arabs are direct descendants of these ancient peoples. To assess this hypoth- esis, we sequenced the entire genomes of 104 unrelated natives of the Arabian Peninsula at high coverage, including 56 of indigenous Arab ancestry. The indigenous Arab genomes defined a cluster distinct from other ancestral groups, and these genomes showed clear hallmarks of an ancient out-of-Africa bottleneck. Similar to other Middle Eastern populations, the indigenous Arabs had higher levels of admixture compared to Africans but had lower levels than Europeans and Asians. These levels of Neanderthal admixture are consistent with an early divergence of Arab ancestors after the out- of-Africa bottleneck but before the major Neanderthal admixture events in Europe and other regions of Eurasia. When compared to worldwide populations sampled in the 1000 Genomes Project, although the indigenous Arabs had a signal of admixture with Europeans, they clustered in a basal, outgroup position to all 1000 Genomes non-Africans when consid- ering pairwise similarity across the entire genome. These results place indigenous Arabs as the most distant relatives of all other contemporary non-Africans and identify these people as direct descendants of the first Eurasian populations estab- lished by the out-of-Africa migrations. [Supplemental material is available for this article.]

All can trace their ancestry back to Africa (Cann et al. The relationship between contemporary Arab populations 1987), where the ancestors of anatomically modern humans first and these ancient human migrations is an open question diverged from primates (Patterson et al. 2006), and then from ar- (Lazaridis et al. 2014; Shriner et al. 2014). Given that the Arabian chaic humans (Prüfer et al. 2014). Humans began leaving Africa Peninsula was an initial site of egress from Africa, one hypothesis through a number of coastal routes, where estimates suggest these is that the original out-of-Africa migrations established ancient “out-of-Africa” migrations reached the Arabian Peninsula as early populations on the peninsula that were direct ancestors of con- as 125,000 yr ago (Armitage et al. 2011) and as late as 60,000 yr temporary Arab populations (Lazaridis et al. 2014). These people ago (Henn et al. 2012). After entering the Arabian Peninsula, would therefore be direct descendants of the earliest split in the human ancestors entered and spread to Australia lineages that established Eurasian and other contemporary non- (Rasmussen et al. 2011), Europe, and eventually, the Americas. African populations (Armitage et al. 2011; Rasmussen et al. 2011; The individuals in these migrations were the most direct ancestors Henn et al. 2012; Lazaridis et al. 2014; Shriner et al. 2014). If this of ancient non-African peoples, and they established the contem- hypothesis is correct, we would expect that there are contempo- porary non-African populations recognized today (Cavalli-Sforza rary, indigenous Arabs who are the most distant relatives of other and Feldman 2003). Eurasians. To assess this hypothesis, we carried out deep-coverage genome sequencing of 104 unrelated natives of the Arabian Peninsula who are citizens of the nation of Qatar (Supplemental Fig. 1), including 56 of indigenous ancestry who are the 8These authors contributed equally to this study. best representatives of autochthonous Arabs, and compared these 9These authors contributed equally as senior investigators for this genomes to contemporary genomes of Africa, Asia, Europe, and study. Corresponding author: [email protected] Article published online before print. Article, supplemental material, and publi- © 2016 Rodriguez-Flores et al. This article, published in Genome Research,is cation date are at http://www.genome.org/cgi/doi/10.1101/gr.191478.115. available under a Creative Commons License (Attribution 4.0 International), Freely available online through the Genome Research Open Access option. as described at http://creativecommons.org/licenses/by/4.0/.

26:151–162 Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/16; www.genome.org Genome Research 151 www.genome.org Downloaded from genome.cshlp.org on March 16, 2016 - Published by Cold Spring Harbor Laboratory Press

Rodriguez-Flores et al. the Americas (The 1000 Genomes Project Consortium 2012; to African populations. A plot of just the Middle Eastern popula- Lazaridis et al. 2014). tions on the principal components also showed clustering as expected, with the Q1 (Bedouin) clustering with previously sam- pled and Arabs, Q2 (Persian-South Asians) with Iranians, Results and Q3 (African) outside of the Middle Eastern cluster (data not shown) (Fig. 1B). Population structure of the Arabian Peninsula Previous analyses of the populations of the Arabian Peninsula Y Chromosome and mitochondrial DNA haplogroups (Hunter-Zinck et al. 2010; Alsmadi et al. 2013) have found three We next analyzed the Y Chromosome (Chr Y) and mitochondrial distinct clusters that reflect primary ancestry: Q1 (Bedouin); Q2 DNA (mtDNA) to assess the degree to which the Q1 (Bedouin), Q2 (Persian-South Asian); and Q3 (African) (Omberg et al. 2012). By (Persian-South Asian), or Q3 (African) Qatari ancestry groups rep- assessment of medical records and ancestry-informative SNP gen- resent distinct subpopulations (Fig. 2). The Chr Y haplogroups otyping (Supplemental Fig. 2), a sample of 108 purportedly unre- showed almost no overlap between the Q1 (Bedouin) Qataris lated individuals was selected for sequencing, including 60 Q1 and Q2 (Persian-South Asian) Qataris, in which an Analysis of (Bedouin), 20 Q2 (Persian-South Asian), and 20 Q3 (African), as Molecular Variance (AMOVA) was highly significant (P < 0.018) well as 8 Q0 (Subpopulation Unassigned) that could not be cleanly (Supplemental Table VII). The Arab haplogroup J1 was the domi- placed in one of these three groups (Supplemental Table I). Each nant haplogroup in the Q1 (Bedouin) Qataris, but this haplogroup of these genomes was sequenced to a median depth of 37× (mini- was not represented at all among the Q2 (Persian-South Asian) mum 30×) by Illumina technology, identifying a total of Qataris (Fig. 2A). This confirmed that these are genetically well- 23,784,210 SNPs (see Methods, Supplemental Table II). defined subpopulations that are relatively isolated from one an- To confirm that none of the 108 individuals were closely re- other (Omberg et al. 2012). There was also a strong partitioning lated, we used KING-robust (Manichaikul et al. 2010) and PREST- of the Chr Y haplogroups when considering the Q3 (African) plus (McPeek and Sun 2000) to estimate family relationships based Qataris, both when considering Q1 (Bedouin) versus Q3 on a set of 1,407,483 SNPs after pruning of the full set of − (African) (AMOVA P <1×10 5) and Q2 (Persian-South Asian) ver- 22,958,844 autosomal SNPs in Qatar (see Methods). Both analyses sus Q3 (African) (AMOVA P < 0.028). The Q3 (African) had largely identified five pairs of related individuals greater than third-degree African haplogroups, a result consistent with the known recent that were subsequently confirmed by investigative reassessment of African admixture of this subpopulation (Omberg et al. 2012). medical records (Supplemental Table III; Supplemental Fig. 3). The mtDNA haplogroups were less partitioned among the Three of the pairs form a trio; hence, two individuals from the Qataris, although they still showed significant partitioning be- trio were removed, and one individual from each of the two re- tween each pair of subpopulations (AMOVA Q1 versus Q2 P < maining pairs was removed, such that the remaining 104 individ- − 0.035, Q1 versus Q3 P <1×10 5, Q2 versus Q3 P < 0.017) and uals analyzed further included 8 Q0 (Subpopulation Unassigned) − among all three considered simultaneously (AMOVA P <1×10 5) and 96 Q1, Q2, or Q3 Qatari: 56 Q1 (Bedouin), 20 Q2 (Persian- (Supplemental Table VII). The mtDNA haplogroups also included South Asian), and 20 Q3 (African). more worldwide geographic diversity overall, indicating a different An analysis of inbreeding for these remaining individuals male versus female pattern of intermarriage among these subpop- showed the Q1 (Bedouin) to have a more positive inbreeding coef- ulations (Sandridge et al. 2010). Together the Chr Y and mtDNA ficient than most of the non-admixed 1000 Genomes (The 1000 haplogroups indicate that the Q1 (Bedouin), Q2 (Persian-South Genomes Project Consortium 2012) populations (Supplemental Asian), and Q3 (African) ancestry groups represent genetic subpop- Table IV; Supplemental Fig. 4), consistent with the known inbreed- ulations that not only reflect known migration history (Hunter- ing of this group (Hunter-Zinck et al. 2010; Omberg et al. 2012); al- Zinck et al. 2010; Omberg et al. 2012) but that also represent units though we also found the Q1 (Bedouin) to be less inbred than many defined by a patrilocal society with strong historical barriers to in- small and/or isolated populations worldwide represented in the termarriage (Esposito 2001; Cavalli-Sforza and Feldman 2003), in Human Origins samples (Lazaridis et al. 2014) (Supplemental which gene flow has been dominated by female movement (i.e., Table V; Supplemental Table VI; Supplemental Fig. 4). The Q2 admixture occurring through females marrying into the relatively (Persian-South Asian) had a positive, but slightly lower, inbreeding isolated subpopulations), as well as female influxes from other geo- coefficient than the Q1 (Bedouin). In contrast, the Q3 (African) had graphic areas. a non-negative coefficient that reflects known admixture with African populations (Hunter-Zinck et al. 2010; Omberg et al. 2012). X-linked and autosomal diversity We confirmed the primary ancestry classifications of the 104 Qataris by principal component analysis (Price et al. 2006). To further analyze the relative male and female contributions to We combined the 104 Qataris, the Human Origins populations the genetics of the Qatari Q1 (Bedouin), Q2 (Persian-South (Lazaridis et al. 2014), and 1000 Genomes populations (The Asian), and Q3 (African) subpopulations, we analyzed genome- 1000 Genomes Project Consortium 2012) (excluding individuals wide ratios of X-linked and autosomal (X/A) diversity and X/A already in Human Origins), and performed principal component diversity ratios for genome intervals >0.18 cM from genes analysis on a set of 197,714 linkage disequilibrium pruned auto- (Supplemental Table VIII; Supplemental Fig. 6). For both of these some SNPs (Fig. 1A; Supplemental Fig. 5A). We also confirmed ratios, the Q1 (Bedouin) and Q2 (Persian-South Asian) were lower these clusterings just with the 104 Qataris and 1000 Genomes sam- than for African populations but were higher than for Europeans ples based on the same set of autosomal SNPs (Supplemental Fig. and Asians. This points to a higher effective population size of fe- 5B). These analyses reproduced the population clustering observed males in the Q1 (Bedouin) and Q2 (Persian-South Asian), possibly previously (Hunter-Zinck et al. 2010; Omberg et al. 2012), with the a consequence of the out-of-Africa migrations, which were be- Q1 (Bedouin) closest to Europeans, the Q2 (Persian-South Asian) lieved to be biased toward migration of males over females between Q1 (Bedouin) and Asians, and the Q3 (African) closest (Gottipati et al. 2011; Arbiza et al. 2014). The Q3 (African)

152 Genome Research www.genome.org Downloaded from genome.cshlp.org on March 16, 2016 - Published by Cold Spring Harbor Laboratory Press

Uninterrupted ancestry in Arabia

A were slightly higher than when compar-

0.02 Study ing European to African populations 1000 Genomes (Gottipati et al. 2011; Arbiza et al. 2014). Human Origins 0.01 Qatari Genomes This could indicate a slightly less extreme Study, Population set of bottleneck events encountered 0.00 1000 Genomes, Africa 1000 Genomes, America since the out-of-Africa migrations by the 1000 Genomes, East Asia direct ancestors of the Q1 (Bedouin) and -0.01 1000 Genomes, Europe HumanOrigins,Africa Q2 (Persian-South Asian) compared to Human Origins, America -0.02 HumanOrigins,Central Asia Siberia the bottlenecks encountered by the direct Human Origins, East Asia ancestors of Europeans. The relative X/A Human Origins, PC1 -0.03 Human Origins, Oceania diversity ratios of Q3 (African) to African Human Origins, South Asia populations were closer to one, consis- -0.04 Human Origins, West Eurasia Qatari Genomes, Q0 (Unassigned) tent with the known African admixture Qatari Genomes, Q1 (Bedouin) of this subpopulation (Omberg et al. -0.05 Qatari Genomes, Q2 (Persian-South Asian) QatariGenomes,Q3(African) 2012). -0.06

-0.07 Pairwise sequential Markov coalescent

-0.03-0.02 -0.01 0.00 0.01 0.02 0.030.04 0.05 0.06 0.07 0.08 analysis B PC2 We next analyzed the full complement of autosomal polymorphisms for signals 0.007 Study Human Origins of ancient bottlenecks by applying the Qatari Genomes pairwise sequential Markov coalescent Population (PSMC) (Fig. 3; Li and Durbin 2011). 0.005 Bedouin A . Bedouin B This analysis showed that the Q1 Druze (Bedouin) and Q2 (Persian-South Asian) Egyptian Comas Egyptian Metspalu had clear hallmarks of a bottleneck 0.003 Iranian event, with effective population size hit- Jordanian Lebanese ting a trough in the range of 100,000

PC1 Palestinian to 30,000 yr ago with a minimum at Q0 (Unassigned) 0.001 Q1 (Bedouin) ∼60,000 yr ago. This same pattern is ob- Q2 (Persian-South Asian) served for a European individual from Q3 (African) Saudi the 1000 Genomes Project and is consis- Syrian -0.001 Turkish Adana tent with what has been observed in oth- Turkish Istanbul er non-African human genomes using Turkish Trabzon Yemen the pairwise sequential Markov coales- -0.003 cent, as well as related methods (Gronau et al. 2011; Fu et al. 2014; Schiffels and -0.020 -0.015 -0.010 -0.050 0.000 Durbin 2014). These data, therefore, PC2 point to the ancestors of Q1 (Bedouin) Figure 1. Principal component analysis (PCA) (Price et al. 2006) of the 104 Qatari genomes (circle), and Q2 (Persian-South Asian) as having 1000 Genomes (triangle), and Human Origins (square) study samples. Shown are individuals plotted migrated out of Africa at the same time on principal components PC1 and PC2, with genomes color-coded by study and population, with the as the ancestors of other non-African Q0 (Subpopulation Unassigned) in gray, Q1 (Bedouin) in red, Q2 (Persian-South Asian) in azure, and Q3 (African) in black. (A) Plot of all populations, defined by study and by population, in which all popu- populations (Henn et al. 2012). Al- lations from the same region and study are grouped and color-coded together (1000 Genomes: Africa, though PSMC estimates in the more re- America, East Asia, and Europe; Human Origins: Africa, America, Central Asia/Siberia, East Asia, Middle cent past tend to have larger confidence East, Oceania, South Asia, and West Eurasia). (B) Plot of Middle Eastern subpopulations from Human intervals (Li and Durbin 2011), the Q1 Origins that cluster near Q1 (Bedouin) and Q2 (Persian-South Asian). (Bedouin) do appear to have a lower pop- ulation size than the Q2 (Persian-South Asian) in the region <30,000 yr ago, con- Qataris had X/A diversity ratios that were higher, even when com- sistent with high levels of inbreeding in the Q1 (Bedouin) (Hunter- pared to African populations. This may be driven by a smaller male Zinck et al. 2010; Sandridge et al. 2010; Mezzavilla et al. 2015). For effective population size; a possible consequence of a polygamous the Q3 (African), the median effective population size was more culture and the ancestry of the Q3 (African) subpopulation that similar to an African individual from the 1000 Genomes Project was a result of the historical slave trade into the region from in the range 100,000 to 30,000 yr ago, consistent with Sub- Africa (Omberg et al. 2012). Saharan African ancestry that is relatively recent (Omberg et al. We also analyzed the relative ratios of X-linked and autosomal 2012). (X/A) diversity in nongenic regions of the female Q1 (Bedouin), Q2 (Persian-South Asian), and Q3 (African) genomes compared to females in African populations of the 1000 Genomes Project Admixture analysis (Supplemental Table IX). The relative X/A ratios of both the Q1 The signal of an ancient bottleneck in the Q1 (Bedouin) is not un- (Bedouin) and Q2 (Persian-South Asian) to African populations expected given previous analyses of genomic admixture that

Genome Research 153 www.genome.org Downloaded from genome.cshlp.org on March 16, 2016 - Published by Cold Spring Harbor Laboratory Press

Rodriguez-Flores et al.

A Y chromosome haplogroups age proportion (45%) and appeared to be more admixed overall. The Q2 (Persian- South Asian) shared a large proportion (45% on average) of ancestry that domi- nates in Iranians (46% on average), consistent with a Persian ancestral popu- lation (Omberg et al. 2012). The Q3 Q1 Q2 Q3 (African) shared the majority of ancestry (Bedouin) (Persian-South Asian) (African) with African populations as expected J1 Arabian E1 African R1 European J2 West Asian and were considerably admixed overall, T West Asian B2 African T2 West Asian again consistent with the known history L1 South Asian of this subpopulation (Supplemental Fig. 7A; Omberg et al. 2012). The ALDER analysis determined the B Mitochondrial DNA haplogroups relative percentage of African (Yoruba) ancestry in the Q1 (Bedouin) (2.6% ± 1.37) and Q2 (Persian-South Asian) (5.0% ± 1.41) at levels on par with esti- mates for other populations sampled in the region (Supplemental Fig. 8; Supple- mental Table X), including Human Q1 Q2 Q3 Origins Bedouin and Saudi. This con- (Bedouin) (Persian-South Asian) (African) firmed that recent African admixture is R0 Arabian L0 African I West Asian U1 West Asian M2 South Asian limited to the Q3 (African) subpopula- R2 Arabian L1 African I7 West Asian U2 West Asian M3 South Asian tion (37.6% ± 0.9), in which this estimate R6 Arabian L2 African J1 West Asian U4 European M5 South Asian is on par with African American popula- R8 Arabian L3 African J2 West Asian U5 European M52South Asian tions. An estimate of the timing of HV West Asian T1 West Asian U6 Mediterranean N1 West Asian H1 Middle Eastern T2 West Asian U9 South Asian African admixture placed the number of H2 West Asian X2West Asian generations for Q1 (Bedouin) (15.2) and H6 West Asian Q2 (Persian-South Asian) (14.0) slightly Figure 2. Y Chromosome (Chr Y) and mitochondrial DNA (mtDNA) haplogroup assignments. The Chr higher than Q3 (African) (9.3), consistent Y and mtDNA haplogroups were determined for Q1 (Bedouin), Q2 (Persian-South Asian), and Q3 with the Q1 (Bedouin) and Q2 (Persian- (African). (A) Pie charts of the haplogroup frequencies for Chr Y. (B) Pie charts of the haplogroup frequen- South Asian) reflecting more distant cies for mtDNA. African admixture events and with the Q3 (African) reflecting the historical tim- found <1% African ancestry in this subpopulation (Omberg et al. ing of the African slave trade in the region (Omberg et al. 2012). 2012) and studies of worldwide population structure, which The SupportMix analysis used six of the 1000 Genomes pop- have inferred that the Q1 (Bedouin) genomes have the greatest ulations (two European, two Asian, and two African) (see proportion of Arab genetic ancestry, even when compared to Supplemental Methods for details) as ancestral proxy reference Bedouins from outside Qatar and to Arabs in surrounding coun- panels and produced a set of “best guess” admixture assignments tries, including Yemen and Saudi Arabia (Hodgson et al. 2014; based on highest similarity to these genomes. Although these Shriner et al. 2014). To confirm a similarly minute amount of 1000 Genomes populations do not include appropriate local pop- African admixture for the Q1 (Bedouin) in our sample, we applied ulations most closely related to the Qataris needed for assessment three methodologies: (1) an ADMIXTURE (Alexander et al. 2009) of the true admixture composition of the genomes, the ancestry analysis of the genome-wide ancestry proportions in the 104 track length distribution of haplotypes assigned to African popula- Qataris, the 1000 Genomes Project (The 1000 Genomes Project tions (Yoruba or Luhuya) provides a qualitative indicator of wheth- Consortium 2012), and Human Origins samples (Lazaridis et al. er the subpopulations experienced recent admixture with African 2014); (2) an ALDER (Loh et al. 2013) analysis of the proportion populations. As expected, the track lengths of the Q1 (Bedouin) and timing of African ancestry in these same populations; and and Q2 (Persian-South Asian) assigned to African 1000 Genomes (3) a SupportMix (Omberg et al. 2012) analysis of the population populations were far shorter than those for Q3 (African) assignments of local genomic segments of the 96 Q1 (Bedouin), (Supplemental Fig. 9), again confirming that recent African admix- Q2 (Persian-South Asian), or Q3 (African) Qatari genomes. ture is limited to the Q3 (African) subpopulation. The ADMIXTURE analysis identified K = 12 ancestral popula- tions as having the lowest cross-validation error (Supplemental Fig. 7A). At this level of resolution, the Q1 (Bedouin) had a high av- Neanderthal ancestry erage (84%) proportion of ancestry that was also present in the We next analyzed Neanderthal admixture contributions to the an- Human Origins Bedouin B population at a high average propor- cestry of Q1 (Bedouin) compared to the Q2 (Persian-South Asian) tion (93%) (Supplemental Fig. 7B,C), in which this same ancestry and Q3 (African) Qataris, the 1000 Genomes populations, and was also shared with Saudis, and at lower levels among other the populations of the Human Origins samples using the F4 ratio Middle Eastern populations. This ancestry therefore appears to and Patterson’s D-statistic (Fig. 4; Supplemental Fig. 10, Supple- be the signal of an indigenous Arab ancestral population. The mental Table XI; Patterson et al. 2012). The results for both meth- Bedouin A population also shared this ancestry but at a lower aver- ods were highly correlated (Supplemental Fig. 10A). The Q1

154 Genome Research www.genome.org Downloaded from genome.cshlp.org on March 16, 2016 - Published by Cold Spring Harbor Laboratory Press

Uninterrupted ancestry in Arabia

Figure 3. Ancient bottlenecks in the 96 Q1 (Bedouin), Q2 (Persian-South Asian), or Q3 (African) Qatari genomes (56 Q1, 20 Q2, 20 Q3) determined by pairwise sequential Markov coalescent analysis (Li and Durbin 2011). Shown is the plot of the median effective population size (y-axis) across individuals in a subpopulation versus years in the past (log scale x-axis) for the samples in the three major Qatari subpopulations: Q1 (Bedouin) in red, Q2 (Persian-South Asian) in azure, Q3 (African) in black. A single individual of European ancestry (NA12879, violet) and a single individual of African ancestry (NA19239, orange) from the 1000 Genomes Project deep-coverage pilot (The 1000 Genomes Project Consortium 2010) are shown for comparison.

(Bedouin; F4 ratio = 0.026, D-statistic = 0.000) had more Neander- TreeMix analysis thal admixture than all African populations, including Q3 We also analyzed the autosomes of the combined 96 Q1 (African; F ratio range = −0.017 to 0.024, D-statistic range = 4 (Bedouin), Q2 (Perisan-South Asian) or Q3 (African) Qataris, and −0.031 to −0.003). The Q1 (Bedouin) also had Neanderthal admix- non-admixed populations of the 1000 Genomes Project using ture at levels comparable to Q2 (Persian-South Asian; F ratio = 4 the population split and mixture inference method TreeMix 0.024, D-statistic = −0.003) and to other Middle Eastern popula- (Pickrell and Pritchard 2012) to assess the relative genetic similar- tions, including other Bedouin populations (Human Origins ity of populations based on high-density, genome-wide allele Bedouin A F ratio = 0.022, D-statistic = −0.003 and Bedouin B F 4 4 frequencies. The analysis returned an overall tree for the 1000 ratio = 0.024, D-statistic = −0.003) and Saudi (F ratio = 0.026, D- 4 Genomes populations that mirrored those found previously statistic = −0.001). Interestingly, the Q1 (Bedouin) did not tend (Shriner et al. 2014) with the addition of the Q1 (Bedouin) and to have higher Neanderthal admixture levels when considering Q2 (Persian-South Asian) clustering on the branch that includes populations outside of the Middle East, where the bulk of Europeans (Pérez-Miranda et al. 2006) and the Q3 (African) clus- European populations had higher Neanderthal admixture (F ratio 4 tering with African populations (Fig. 5). When migrations were al- range = 0.018 to 0.041, D-statistic range = 0.003 to 0.010). Yet, the lowed in the analysis, no migration events were observed between percentage of Neandethal admixture with the Q1 (Bedouin) was the Q1 (Bedouin) and African populations, even when allowing as higher than expected if it could be entirely explained by later ad- many as five migration events (Supplemental Fig. 11). These re- mixture events between the Q1 (Bedouin) and Europeans (ob- sults are also consistent with what is known of the migration his- served F ratio = 0.026 versus expected F ratio = 0.00247). 4 4 tory of the Arabian Peninsula, including migration both to and The higher Neanderthal ancestry in the Q1 (Bedouin) Qatari from Europe during ancient and more recent eras of civilization, compared to African populations places the divergence of an- where this resulted in detectable admixture from European popu- cestral Arabs after the out-of-Africa bottleneck. Given the lations in both the Q1 (Bedouin) and Q2 (Persian-South Asian) current evidence of the geographic range of Neanderthal popula- (Omberg et al. 2012). tions stretching from Europe and the Mediterranean through Northern and Central Asia (Fu et al. 2014; Hershkovitz et al. 2015), the lower Neanderthal Ancestry in the Q1 (Bedouin) Proportion of shared alleles neighbor-joining analysis Qatari compared to populations within the ancestral Neanderthal As the principal component analysis and the TreeMix population- range is also consistent with an early divergence of the ancestors level clusterings depend on allele frequencies, the clustering of the of indigenous Arabs from other lineages that populated Asia Q1 (Bedouin) on a common branch with European populations and Europe. Yet, since the Neanderthal admixture in the Q1 could be driven by the haplotypes introduced by migrants, which (Bedouin) cannot be entirely explained by admixture with would be expected to shift the allele frequencies of these popula- Europeans, this indicates there was some admixture between tions toward each other. As such, these clusterings based on allele and ancestors of the Q1 (Bedouin) in the region of frequencies do not necessarily argue against significant and the Arabian Peninsula. deep ancestry of the Q1 (Bedouin) on the Arabian Peninsula,

Genome Research 155 www.genome.org Downloaded from genome.cshlp.org on March 16, 2016 - Published by Cold Spring Harbor Laboratory Press

Rodriguez-Flores et al.

Neanderthal ancestry proportion In contrast to population-level clus- -0.020 0.110 0.084 0.033 0.058 0.006 tering, a pairwise clustering of individual genomes based on shared variants pro- vides a relative measure for comparing total shared ancestry between individu- als. Also, when applied to a common Bantu SA Zulu Study, Population set of genome-wide, high-density mark- LWK 1000 Genomes, Africa ers that include the low-minor allele fre- ASW 1000 Genomes, America Q3 (African) 1000 Genomes, East Asia quency alleles of the 1000 Genomes Egyptian Metspalu 1000 Genomes, Europe Project, such pairwise clustering also pro- Bolivian Cochabamba Human Origins, Africa vides an appropriate weight to rare Chukchi Sir Human Origins, America Yemen HumanOrigins,Central Asia Siberia alleles. We therefore performed a propor- Spanish Canarias IBS Human Origins, East Asia tion of shared alleles (Mountain and Gujarati A GIH Human Origins, Middle East Cavalli-Sforza 1997) analysis on the com- Syrian Human Origins, Oceania bined samples in the 104 Qatari and the Iranian Human Origins, South Asia Oroqen Human Origins, West Eurasia 1000 Genomes samples, in which pair- Turkish Trabzon Qatari Genomes, Q1 (Bedouin) wise proportion of shared alleles was cal- Jordanian Qatari Genomes, Q2 (Persian-South Asian) culated for the 11,711,386 autosomal, Qatari Genomes, Q3 (African) Bedouin B biallelic SNPs segregating in both the Druze 104 Qatari and the 1000 Genomes sam- Palestinian IBS ples. A robust version of the neighbor- Bedouin A joining algorithm was used to perform a Bantu SA Ovambo pairwise clustering of the samples (Fig. Q2 (Persian-South Asian) 6A–F; Criscuolo and Gascuel 2008), in Egyptian Comas eanderthal ancestry proportion (F4 ratio) ancestry eanderthal Turkish Balikesir which bootstrap support values were cal- MXL culated for the observed trees using 100 Turkish random samplings of the SNPs. Q1 (Bedouin) The neighbor-joining analysis re- Turkish Kayseri Saudi vealed that 50 of the 56 Q1 (Bedouin), Lebanese along with three Q2 (Persian-South Turkish Istanbul Asian), one Q3 (African), and two Q0 CLM (Subpopulation Unassigned) Qataris, Turkish Aydin clustered outside African lineages and Turkish Adana FIN were also the most extreme outgroup JPT that are basal to all non-African popula- CHB tions lacking recent African admixture Kusunda (Fig. 6D). Strong bootstrap support was Karitiana Mansi observed for this cluster (70 of 100 itera-

Population, in order of increasing N of increasing Population, in order Italian East Sicilian tions), and for its presence as an outgroup Naxi to the Eurasian cluster (68 of 100 itera- Bougainville tions), comparable to the support for the Australian ECCAC Japanese cluster (60 of 100 iterations) and for the East Asians as an outgroup to Figure 4. Neanderthal ancestry in world populations. F4 ratio estimation as implemented in ADMIXTOOLS 3.0 (Patterson et al. 2012) was used to calculate the Neanderthal ancestry proportion Europeans and Americans (81 of 100 iter- for each population in the combined data set of Qatari genomes, the 1000 Genomes Project, and ations).TheQ1(Bedouin)thereforefit the Human Origins. The F4 ratio estimates α, the proportion of Neanderthal ancestry in a population. criteria of having ancient migration from Shown are the results for populations of interest, including highest and lowest scoring populations Africa and being most distantly related to from each region (the 1000 Genomes Project, Africa; the 1000 Genomes Project, America; the 1000 all other non-Africans in total ancestry. Genomes Project, East Asia, the 1000 Genomes Project, Europe, Human Origins, Africa; Human Origins, America; Human Origins, Central Asia/Siberia; Human Origins, East Asia; Human Origins, A total of 11 Q2 (Persian-South Oceania; Human Origins, South Asia; Human Origins, West Eurasia), Middle Eastern populations Asian), three Q1 (Bedouin), one Q3 (Human Origins), Q1 (Bedouin), Q2 (Persian-South Asian) and Q3 (African). Populations are color-coded (African), and one Q0 (Subpopulation by region, and a distinct color is used for each Qatari population. A full set of results is presented in Unassigned) defined an Asian outgroup Supplemental Figure 10 and Supplemental Table XI. The population codes are as in the 1000 Genomes Project (The 1000 Genomes Project Consortium 2012). more closely related to Asians than the main Q1 (Bedouin) outgroup (Fig. 6C), likely driven by the ancestry of the the as indicated by the levels of Neanderthal admixture in this subpo- Q2 (Persian-South Asian) subpopulation traceable to Persia and pulation. Additionally, these population-level clusterings are dis- South Asia (Omberg et al. 2012) and indicating these individuals proportionately influenced by common segregating alleles are most distantly related to other Asians present in this cluster. (Pickrell and Pritchard 2012), while rare alleles can be more infor- A total of 12 Q3 (African), three Q1 (Bedouin), three Q2 (Persian- mative about deeper shared ancestry (Mathieson and McVean South Asian), and four Q0 (Subpopulation Unassigned) cluster as 2014) as the identity by state of a rare variant can more accurately long individual branches or small clusters between the major Q1 reflect identity by descent (Hochreiter 2013). (Bedouin) cluster and the admixed individuals of African ancestry

156 Genome Research www.genome.org Downloaded from genome.cshlp.org on March 16, 2016 - Published by Cold Spring Harbor Laboratory Press

Uninterrupted ancestry in Arabia

Yoruba Nigerian (YRI) of the autochthonous population on the Arabian Peninsula, these

Luhuya Kenyan (LWK) two conclusions therefore point to the Bedouins being direct de- scendants of the earliest split after the out-of-Africa migration African (Q3) events that established a basal Eurasian population (Lazaridis Iberian Spanish (IBS) et al. 2014). This is also consistent with the majority of Q1 Finnish (FIN) (Bedouin) being able to trace a significant portion of their autoso-

British (GBR) mal ancestry through lineages that never left the peninsula after the out-of-Africa migration events since such deep ancestry would Utahns (CEU) not be expected if the entire Arabian Peninsula population had Tuscan Italian (TSI) been reestablished from Africa or a non-African population at a lat- Bedouin (Q1) er point.

Persian-South Asian (Q2) Given the complex history of migration patterns to and from European populations, and the complicated patterns of isolation Japanese (JPT) and intra- and inter-marriage of the indigenous Bedouin popula- Han Chinese Beijing (CHB) tions (Hunter-Zinck et al. 2010; Sandridge et al. 2010), it is not sur- Han Chinese South (CHS) prising that among the Q1 (Bedouin) are individuals who retain an 0.002 autosomal signal of being the most distant relatives of non- Figure 5. TreeMix (Pickrell and Pritchard 2012) hierarchical clustering Africans, while population-level clustering based on migration- analysis of the Q1 (Bedouin), Q2 (Persian-South Asian), and Q3 (African) shifted allele frequencies places the Q1 (Bedouin) closer to and the 1000 Genomes Project samples. Shown is a maximum-likelihood Europeans. The basal position of the Q1 (Bedouin) also has inter- tree of population splits inferred without subsequent migration events, in esting implications for theories about the frequency, timing, and which branch lengths estimate divergence between populations path of major migration waves that established populations in (Europeans in shades of purple: CEU, FIN, GBR, IBS, TSI; East Asians in shades of brown: CHB, CHS, JPT; Africans in shades of orange: LWK, YRI, Asia and Europe (Shi et al. 2008; Lazaridis et al. 2014; Shriner with the Q1 [Bedouin] in red, Q2 [Persian-South Asian] in azure, and Q3 et al. 2014). A few isolated Asian populations were previously sus- [African] in black). When allowing from one to five migration events in sep- pected to be descendants of a separate out-of-Africa migration arate TreeMix analyses, none of the admixture loops connected the Q1 wave based on Y Chromosome data (Hammer et al. 1998; Shi (Bedouin) with any African populations (Supplemental Fig. 10), consistent with the Q1 (Bedouin) having no recent African admixture. et al. 2008). Yet, distinct out-of-Africa migration events or separate migration waves emanating from the Arabian Peninsula into Europe and West Asia would be expected to place Bedouins/ from Southwest US (ASW), potentially representing individuals Europeans and Asians on separate branches of a pairwise clustering with a higher proportion of African admixture. As expected from tree, distinct from our finding that places the Q1 (Bedouin) as di- the analyses of population genetic similarity and prior neighbor- rect descendants of the earliest lineage that split from the ancient joining analysis of admixed populations (Kopelman et al. 2013), non-African population. the Q3 (African) and African Americans do not form large clusters, A demographic scenario consistent with the evidence pre- but rather appear as multiple individual branches close to the in- sented here is that the population ancestral to the Q1 (Bedouin) digenous African populations, most similar to their African admix- migrated out of Africa, and a subset of this population remained ture source (Fig. 6E,F). A set of three Q2 (Persian-South Asian) in the peninsula until the present day, while a second subset of clustered as an outgroup to the Tuscan Southern European (TSI) this population migrated onward and colonized Eurasia. This mi- branch (Fig. 6B), which is not unexpected given admixture with gration scenario implies the signal of the same bottleneck would European populations (Omberg et al. 2012; Pickrell et al. 2014). be present in all non-African populations, which has been ob- served thus far in coalescent analysis of contemporary non- Discussion African populations (Gronau et al. 2011; Fu et al. 2014; Schiffels and Durbin 2014) and for an anatomically modern human who The hypothesis that the first Eurasian populations were estab- lived 45,000 yr ago (Fu et al. 2014). This is also consistent with lished on the Arabian Peninsula and that contemporary indige- the recent discovery of another anatomically modern human nous Arabs are direct descendants of this ancient population is who lived 55,000 yr ago just northeast of the Arabian Peninsula supported by two major conclusions derived from the combined that had morphological features similar to European peoples evidence of this study. First, the analysis results for X/A diversity, (Hershkovitz et al. 2015), where this individual could have been the pairwise sequential Markov coalescent, genome-wide admix- a descendant of the basal Eurasian population that remained on ture, timing of African admixture, local admixture deconvolution, the peninsula. Under this migration scenario, although other Neanderthal admixture, and application of TreeMix, support the waves of migration may have occurred, the descendants of these inference that the Q1 (Bedouin) can trace the bulk of their ancestry alternative waves either left no descendants or were integrated back to the out-of-Africa migration events. Second, the com- into the dominant populations. bination of lower levels of Neanderthal admixture in the Q1 Beyond the importance for disentangling human migration (Bedouin) than European/Asian populations and the outgroup po- history, an early split of Eurasian lineages in the Arabian sition of the Q1 (Bedouin) compared to non-Africans in the Peninsula has implications for the study of disease genetics for in- pairwaise similarity clustering of high-density variants measured digenous people in the region. For example, for a disease such as genome-wide, place the Q1 (Bedouin) as being the most distant type 2 diabetes that has a prevalence of >18% in the Qatari popu- relatives of other contemporary non-Africans. Given that the Q1 lation, associated genetic variants would not a priori be expected (Bedouin) have the greatest proportion of Arab genetic ancestry to be the same as those discovered in Europeans, when considering measured in contemporary populations (Hodgson et al. 2014; that indigenous Arabs are able to trace a significant portion of their Shriner et al. 2014) and are among the best genetic representatives ancestry back to ancient lineages on the Arabian Peninsula. More

Genome Research 157 www.genome.org Downloaded from genome.cshlp.org on March 16, 2016 - Published by Cold Spring Harbor Laboratory Press

Rodriguez-Flores et al.

CEU_NA12154 CEU_NA12751 GBR_HG00136 CEU_NA11829 CEU_NA11892 CEU_NA12777 CEU_NA12043 CEU_NA12058 GBR_HG00252 CEU_NA12340 GBR_HG00232 GBR_HG00253 GBR_HG00258 GBR_HG00262 GBR_HG00103 CEU_NA11831 GBR_HG00104 tant contributions to disease risk in these A GBR_HG00106 CEU_NA12842 CEU_NA12874 GBR_HG00141 GBR_HG00126 CEU_NA12775 CEU_NA12004 CEU_NA12399 CEU_NA11920 CEU_NA12829 CEU_NA11932 B CEU_NA12383 CEU_NA12341 CEU_NA12717 CEU_NA11992 GBR_HG00108 British and Utahn GBR_HG00154 GBR_HG00160 CEU_NA07056 CEU_NA12827 CEU_NA10851 CEU_NA12750 CEU_NA12716 CEU_NA12045 CEU_NA12046 CEU_NA07048 FIN_HG00173 FIN_HG00319 FIN_HG00269 FIN_HG00361 FIN_HG00177 FIN_HG00271 populations are yet to be discovered. FIN_HG00284 FIN_HG00335 FIN_HG00185 FIN_HG00343 FIN_HG00366 FIN_HG00278 FIN_HG00342 FIN_HG00367 (GBR, CEU) FIN_HG00373 FIN_HG00176 FIN_HG00377 FIN_HG00182 FIN_HG00272 FIN_HG00324 FIN_HG00325 FIN_HG00313 FIN_HG00338 FIN_HG00369 FIN_HG00282 FIN_HG00285 FIN_HG00323 FIN_HG00383 FIN_HG00281 FIN_HG00266 FIN_HG00336 FIN_HG00353 FIN_HG00309 FIN_HG00328 FIN_HG00171 FIN_HG00326 FIN_HG00276 FIN_HG00321 This study is the first analysis of FIN_HG00186 FIN_HG00381 FIN_HG00178 FIN_HG00280 FIN_HG00320 FIN_HG00372 FIN_HG00179 FIN_HG00187 FIN_HG00274 FIN_HG00273 FIN_HG00267 FIN_HG00327 FIN_HG00189 FIN_HG00330 FIN_HG00357 FIN_HG00346 FIN_HG00376 FIN_HG00318 FIN_HG00334 FIN_HG00349 FIN_HG00355 FIN_HG00384 FIN_HG00359 FIN_HG00360 Finnish (FIN) FIN_HG00358 FIN_HG00382 FIN_HG00356 FIN_HG00362 FIN_HG00364 FIN_HG00351 FIN_HG00378 FIN_HG00180 Arabian Peninsula migration making FIN_HG00332 FIN_HG00345 FIN_HG00375 FIN_HG00310 FIN_HG00174 FIN_HG00315 FIN_HG00329 FIN_HG00183 C FIN_HG00339 FIN_HG00188 FIN_HG00312 FIN_HG00270 FIN_HG00331 FIN_HG00268 FIN_HG00275 FIN_HG00337 FIN_HG00344 FIN_HG00277 FIN_HG00306 FIN_HG00190 FIN_HG00311 FIN_HG00341 FIN_HG00350 CEU_NA11930 CEU_NA12003 GBR_HG00158 GBR_HG00256 GBR_HG00143 GBR_HG00250 GBR_HG00261 CEU_NA11995 CEU_NA11931 use of deeply sequenced genomes from CEU_NA11933 CEU_NA12748 GBR_HG00260 CEU_NA12763 GBR_HG00131 GBR_HG00133 CEU_NA06989 CEU_NA12155 GBR_HG00233 CEU_NA11919 CEU_NA12144 GBR_HG00231 GBR_HG01334 GBR_HG00146 GBR_HG00148 GBR_HG00125 GBR_HG00259 CEU_NA07000 CEU_NA11893 CEU_NA12749 GBR_HG00236 GBR_HG00264 GBR_HG00151 CEU_NA12778 CEU_NA11830 CEU_NA12812 CEU_NA12814 GBR_HG00139 GBR_HG00150 CEU_NA07051 GBR_HG00245 CEU_NA07037 a sample of unrelated inhabitants of the CEU_NA12830 GBR_HG00130 IBS_HG01618 GBR_HG00097 IBS_HG01624 GBR_HG00135 GBR_HG00255 IBS_HG01625 IBS_HG01626 IBS_HG01619 IBS_HG01617 IBS_HG01623 IBS_HG01620 GBR_HG00129 GBR_HG00235 GBR_HG00128 GBR_HG00113 GBR_HG00234 GBR_HG00238 GBR_HG00240 GBR_HG00102 GBR_HG00121 GBR_HG00099 GBR_HG00109 GBR_HG00100 GBR_HG00101 GBR_HG00116 GBR_HG00120 GBR_HG00114 GBR_HG00096 GBR_HG00110 CEU_NA12272 peninsula. Although there have been GBR_HG00118 GBR_HG00263 GBR_HG00156 CEU_NA12273 GBR_HG00112 D GBR_HG00123 GBR_HG00119 GBR_HG00124 British and Utahn GBR_HG00117 GBR_HG00257 GBR_HG00122 GBR_HG00140 CEU_NA12889 CEU_NA12275 CEU_NA11843 CEU_NA12347 GBR_HG00111 CEU_NA12873 CEU_NA06984 GBR_HG00138 GBR_HG00159 CEU_NA06994 CEU_NA12249 CEU_NA12546 CEU_NA12282 GBR_HG00149 GBR_HG00243 CEU_NA12489 CEU_NA12006 CEU_NA12044 GBR_HG00134 (GBR, CEU) GBR_HG00142 many analyses of Chr Y and mtDNA sam- GBR_HG00242 GBR_HG00244 GBR_HG00246 CEU_NA12890 CEU_NA12286 CEU_NA12342 GBR_HG00249 GBR_HG00265 CEU_NA12761 GBR_HG00239 GBR_HG00254 CEU_NA12815 CEU_NA07347 CEU_NA12718 GBR_HG00127 CEU_NA06986 GBR_HG00137 CEU_NA12843 CEU_NA11894 CEU_NA11993 GBR_HG00152 CEU_NA12287 CEU_NA12283 CEU_NA12348 GBR_HG00237 GBR_HG00251 GBR_HG00247 CEU_NA10847 CEU_NA11994 CEU_NA12400 CEU_NA07357 pled from Arab individuals (Abu-Amero GBR_HG00155 CEU_NA12413 CEU_NA12872 TSI_NA20803 MXL_NA19675 MXL_NA19679 MXL_NA19678 MXL_NA19648 MXL_NA19676 MXL_NA19725 MXL_NA19652 MXL_NA19651 MXL_NA19779 MXL_NA19684 MXL_NA19750 CLM_HG01383 CLM_HG01465 CLM_HG01550 CLM_HG01374 CLM_HG01251 CLM_HG01113 CLM_HG01351 CLM_HG01357 CLM_HG01378 MXL_NA19719 CLM_HG01149 CLM_HG01350 CLM_HG01495 CLM_HG01456 CLM_HG01455 CLM_HG01494 CLM_HG01497 et al. 2007, 2008, 2009; Rowold et al. CLM_HG01498 MXL_NA19716 MXL_NA19761 MXL_NA19731 MXL_NA19732 MXL_NA19729 MXL_NA19728 MXL_NA19735 MXL_NA19741 MXL_NA19740 MXL_NA19758 MXL_NA19759 MXL_NA19783 MXL_NA19746 MXL_NA19785 MXL_NA19755 MXL_NA19720 MXL_NA19788 MXL_NA19663 ASW_NA20314 MXL_NA19717 MXL_NA19762 MXL_NA19681 MXL_NA19734 MXL_NA19657 MXL_NA19753 MXL_NA19786 MXL_NA19654 MXL_NA19747 MXL_NA19771 MXL_NA19726 MXL_NA19738 2007), and there have been previous sur- MXL_NA19723 Colombian and Mexican MXL_NA19660 MXL_NA19672 MXL_NA19664 MXL_NA19661 MXL_NA19685 MXL_NA19752 MXL_NA19774 MXL_NA19777 MXL_NA19756 MXL_NA19776 MXL_NA19722 MXL_NA19782 MXL_NA19682 MXL_NA19789 MXL_NA19764 MXL_NA19795 CLM_HG01441 MXL_NA19770 MXL_NA19780 MXL_NA19749 MXL_NA19773 CLM_HG01137 CLM_HG01377 CLM_HG01250 (CLM, MXL) CLM_HG01356 CLM_HG01136 CLM_HG01375 MXL_NA19655 ASW_NA19625 ASW_NA20414 ASW_NA20299 veys of genetic variation of people within CLM_HG01551 CLM_HG01462 CLM_HG01342 CLM_HG01390 CLM_HG01461 CLM_HG01365 CLM_HG01440 CLM_HG01124 CLM_HG01366 CLM_HG01277 CLM_HG01278 CLM_HG01272 CLM_HG01344 CLM_HG01274 MXL_NA19737 CLM_HG01271 CLM_HG01257 CLM_HG01140 CLM_HG01360 CLM_HG01125 CLM_HG01384 CLM_HG01134 CLM_HG01148 CLM_HG01359 CLM_HG01345 MXL_NA19794 CLM_HG01354 CLM_HG01489 CLM_HG01488 CLM_HG01133 CLM_HG01353 CLM_HG01491 the peninsula and immediately sur- CLM_HG01389 CLM_HG01492 CLM_HG01112 CLM_HG01437 CLM_HG01259 CLM_HG01275 PUR_HG00640 PUR_HG01048 PUR_HG01060 PUR_HG01174 PUR_HG00732 PUR_HG01080 PUR_HG00731 PUR_HG01187 E PUR_HG01191 PUR_HG01097 PUR_HG00734 PUR_HG01067 PUR_HG01183 PUR_HG01069 PUR_HG00641 PUR_HG01073 PUR_HG01197 PUR_HG01055 PUR_HG00740 PUR_HG01066 PUR_HG01061 PUR_HG01070 PUR_HG01095 PUR_HG00638 PUR_HG01198 PUR_HG01176 rounding regions conducted with geno- PUR_HG01204 PUR_HG01101 PUR_HG01188 PUR_HG01167 PUR_HG00553 PUR_HG01052 PUR_HG01051 PUR_HG01190 PUR_HG01107 Puerto Rican (PUR) PUR_HG01082 PUR_HG01102 PUR_HG01171 PUR_HG00637 PUR_HG01083 PUR_HG01170 PUR_HG00554 PUR_HG01072 PUR_HG01098 PUR_HG00736 PUR_HG01079 PUR_HG00737 PUR_HG01047 PUR_HG01104 PUR_HG01085 PUR_HG01105 PUR_HG01173 PUR_HG01075 IBS_HG01519 IBS_HG01521 IBS_HG01515 IBS_HG01522 IBS_HG01516 typing arrays (Behar et al. 2010; Hunter- IBS_HG01518 TSI_NA20524 TSI_NA20530 TSI_NA20507 TSI_NA20785 TSI_NA20502 TSI_NA20585 Iberian Spanish (IBS) TSI_NA20534 TSI_NA20543 TSI_NA20792 TSI_NA20796 TSI_NA20517 TSI_NA20537 TSI_NA20505 TSI_NA20753 TSI_NA20525 TSI_NA20757 TSI_NA20815 TSI_NA20544 TSI_NA20801 TSI_NA20522 TSI_NA20814 TSI_NA20783 L TSI_NA20798 TSI_NA20586 TSI_NA20775 L TSI_NA20778 TSI_NA20813 TSI_NA20755 A TSI_NA20773 TSI_NA20800 TSI_NA20807 Zinck et al. 2010; Alsmadi et al. 2013; Q TSI_NA20808 TSI_NA20588 TSI_NA20754 Y TSI_NA20508 TSI_NA20513 TSI_NA20805 Y TSI_NA20536 TSI_NA20510 TSI_NA20761 TSI_NA20760 TSI_NA20804 TSI_NA20786 TSI_NA20811 TSI_NA20535 TSI_NA20812 TSI_NA20512 F TSI_NA20802 TSI_NA20519 TSI_NA20540 TSI_NA20504 TSI_NA20527 TSI_NA20810 TSI_NA20582 TSI_NA20772 TSI_NA20515 TSI_NA20799 TSI_NA20765 TSI_NA20521 TSI_NA20581 TSI_NA20509 TSI_NA20529 TSI_NA20516 Markus et al. 2014; Shriner et al. 2014) TSI_NA20816 Tuscan Italian (TSI) TSI_NA20538 TSI_NA20795 TSI_NA20532 TSI_NA20806 TSI_NA20518 TSI_NA20539 TSI_NA20542 TSI_NA20790 TSI_NA20774 TSI_NA20818 TSI_NA20506 TSI_NA20828 TSI_NA20759 TSI_NA20771 TSI_NA20809 TSI_NA20797 TSI_NA20826 TSI_NA20589 TSI_NA20756 TSI_NA20541 TSI_NA20768 TSI_NA20503 TSI_NA20520 TSI_NA20758 TSI_NA20531 TSI_NA20770 TSI_NA20528 TSI_NA20533 TSI_NA20787 TSI_NA20752 TSI_NA20769 and deep exome sequencing (Rodri- TSI_NA20819 PUR_HG01168 TSI_NA20766 Q2_DGMQ32056 Q2_DGMQ31269 Q2_DGMQ31611 CHB_NA18564 CHB_NA18592 CHB_NA18561 CHS_HG00611 CHB_NA18563 CHB_NA18619 B CHB_NA18618 CHB_NA18567 Persian-South Asian Qataris (Q2) CHB_NA18611 CHB_NA18597 CHB_NA18553 CHB_NA18559 CHS_HG00592 CHB_NA18548 CHB_NA18541 CHB_NA18599 CHB_NA18635 CHB_NA18539 CHB_NA18546 CHB_NA18612 CHB_NA18547 CHB_NA18572 CHB_NA18602 CHB_NA18595 CHB_NA18633 CHB_NA18566 guez-Flores et al. 2012, 2014; Alsmadi CHB_NA18596 CHB_NA18579 CHB_NA18613 CHB_NA18544 CHB_NA18610 CHB_NA18571 CHB_NA18538 CHB_NA18535 CHB_NA18615 CHB_NA18562 CHB_NA18620 CHB_NA18593 CHB_NA18558 CHB_NA18542 CHB_NA18557 CHB_NA18621 CHB_NA18638 CHB_NA18537 CHB_NA18622 CHB_NA18623 CHB_NA18609 CHS_HG00628 CHB_NA18560 CHS_HG00422 CHS_HG00590 CHS_HG00530 CHS_HG00620 CHS_HG00629 CHS_HG00625 CHB_NA18536 CHB_NA18630 et al. 2014), and by individual high-cov- CHS_HG00580 CHB_NA18555 CHS_HG00404 CHS_HG00654 CHS_HG00684 CHS_HG00457 CHS_HG00476 CHS_HG00543 CHS_HG00619 CHS_HG00556 CHS_HG00634 CHB_NA18616 CHS_HG00689 CHS_HG00708 CHS_HG00613 CHS_HG00577 CHS_HG00584 CHS_HG00595 CHB_NA18614 CHS_HG00513 CHS_HG00560 CHS_HG00701 CHS_HG00566 CHS_HG00589 CHS_HG00583 CHB_NA18636 CHS_HG00705 CHB_NA18582 CHS_HG00663 CHS_HG00671 CHS_HG00672 CHS_HG00698 erage genomes (Alsmadi et al. 2014; CHS_HG00653 CHS_HG00693 CHS_HG00463 CHS_HG00478 CHS_HG00704 CHS_HG00707 CHS_HG00650 CHS_HG00683 CHS_HG00657 CHS_HG00702 CHS_HG00656 CHS_HG00662 CHS_HG00565 CHB_NA18526 CHS_HG00699 CHS_HG00690 CHB_NA18565 CHS_HG00421 CHS_HG00451 CHS_HG00418 CHS_HG00427 CHB_NA18631 Han Chinese (CHB, CHS) CHB_NA18637 CHS_HG00464 CHB_NA18549 CHS_HG00593 CHS_HG00403 CHS_HG00531 CHS_HG00614 CHS_HG00651 CHS_HG00449 CHB_NA18634 John et al. 2015), the sample of rare and CHS_HG00419 CHS_HG00610 CHS_HG00437 CHS_HG00537 CHS_HG00626 CHB_NA18606 CHS_HG00475 CHS_HG00542 CHS_HG00500 CHS_HG00536 CHS_HG00472 CHB_NA18577 CHS_HG00436 CHS_HG00473 CHS_HG00443 CHS_HG00448 CHB_NA18570 CHB_NA18632 CHS_HG00557 CHS_HG00596 CHB_NA18617 CHB_NA18534 CHB_NA18552 CHB_NA18624 CHB_NA18605 CHS_HG00578 CHS_HG00581 CHS_HG00635 CHS_HG00607 CHS_HG00608 CHB_NA18545 CHB_NA18627 common genetic variation throughout CHS_HG00479 CHS_HG00534 CHB_NA18574 CHB_NA18603 CHB_NA18532 CHB_NA18543 CHB_NA18573 CHS_HG00428 CHS_HG00559 CHB_NA18628 CHS_HG00512 CHS_HG00524 CHS_HG00501 CHS_HG00692 CHS_HG00458 CHS_HG00533 CHS_HG00525 CHB_NA18530 CHB_NA18550 CHB_NA18576 British (GBR) CHB_NA18608 CHB_NA18639 CHB_NA18645 CHS_HG00407 CHB_NA18525 CHB_NA18748 CHS_HG00406 CHS_HG00445 CHS_HG00442 CHS_HG00446 CHS_HG00452 CHB_NA18643 the genome in our sample provides a far CHB_NA18757 CHB_NA18640 CHB_NA18749 CHB_NA18745 CHB_NA18747 CHB_NA18527 CHB_NA18740 CHB_NA18528 CHB_NA18641 CHB_NA18647 CHB_NA18642 JPT_NA19058 JPT_NA19078 JPT_NA19055 JPT_NA19083 JPT_NA18945 Utahn (CEU) JPT_NA19003 JPT_NA19076 JPT_NA18951 JPT_NA18990 JPT_NA19062 JPT_NA18974 JPT_NA18989 JPT_NA19054 JPT_NA19075 JPT_NA18983 JPT_NA19087 JPT_NA18963 JPT_NA19067 JPT_NA19002 JPT_NA18947 JPT_NA19081 more complete picture of how both an- JPT_NA19063 JPT_NA19064 JPT_NA18964 JPT_NA19004 JPT_NA18986 JPT_NA18946 JPT_NA18948 JPT_NA18975 JPT_NA18971 JPT_NA19000 JPT_NA18988 JPT_NA18952 Finn (FIN) JPT_NA18956 JPT_NA18943 JPT_NA18968 JPT_NA19060 JPT_NA18950 JPT_NA18985 JPT_NA18982 JPT_NA19074 JPT_NA18984 JPT_NA18949 JPT_NA19080 JPT_NA19059 JPT_NA19079 JPT_NA19065 JPT_NA19088 JPT_NA19007 JPT_NA18941 JPT_NA18960 JPT_NA19070 JPT_NA18973 cient and recent migration events have JPT_NA19056 JPT_NA18940 JPT_NA18959 Japanese (JPT) JPT_NA18965 JPT_NA18942 JPT_NA18976 JPT_NA19005 JPT_NA19068 Colombian (CLM) JPT_NA19012 JPT_NA19066 JPT_NA19077 JPT_NA18961 JPT_NA18977 JPT_NA19072 JPT_NA18944 JPT_NA19085 JPT_NA19057 JPT_NA18953 JPT_NA18981 JPT_NA18999 JPT_NA19082 JPT_NA19084 JPT_NA18980 JPT_NA19009 JPT_NA18987 JPT_NA19010 JPT_NA18978 JPT_NA18998 JPT_NA18939 JPT_NA18994 JPT_NA18995 JPT_NA18954 contributed to the genetics of the mod- JPT_NA18962 JPT_NA18957 JPT_NA18992 JPT_NA18966 Mexican (MXL) CHB_NA18626 Q2_DGMQ31455 Q1_DGMQ31444 Q2_DGMQ31514 Q2_DGMQ31529 Q0_DGMQ31598 Q2_DGMQ32368 Q1_DGMQ31623 Q2_DGMQ31601 Q2_DGMQ31574 Q2_DGMQ31226 Q1_DGMQ31607 Q2_DGMQ31582 C Q2_DGMQ31872 Q2_DGMQ32325 Q3_DGMQ32371 Q2_DGMQ31537 Q1_DGMQ32118 Q1_DGMQ31274 Persian-South Asian (Q2) Q1_DGMQ32308 Q1_DGMQ31131 Q1_DGMQ31725 Q1_DGMQ31705 Q1_DGMQ32064 Q1_DGMQ31449 Q1_DGMQ32224 Q1_DGMQ32287 Q1_DGMQ31433 Puerto Rican (PUR) ern peoples of the Arabian Peninsula. Q1_DGMQ31614 Q0_DGMQ31627 Q3_DGMQ31643 Q1_DGMQ31617 Q1_DGMQ31717 Q1_DGMQ31568 Q1_DGMQ32121 Q1_DGMQ31711 Q1_DGMQ31129 Q0_DGMQ31225 Q1_DGMQ31237 Q1_DGMQ32302 Q1_DGMQ31714 Q1_DGMQ31644 Q1_DGMQ31556 Q2_DGMQ31552 Q1_DGMQ32148 Q1_DGMQ32061 Q1_DGMQ31030 Q1_DGMQ31650 Q1_DGMQ31029 Q1_DGMQ31245 D Q1_DGMQ31230 Q1_DGMQ32182 Q1_DGMQ31254 Q1_DGMQ31639 Bedouin (Q1) Q1_DGMQ31123 Iberian Spanish (IBS) Q1_DGMQ32167 Q1_DGMQ31470 Q1_DGMQ31508 Q1_DGMQ31695 Q1_DGMQ31516 For understanding how human migra- Q1_DGMQ31005 Q1_DGMQ31606 Q1_DGMQ31645 Q1_DGMQ32293 Q1_DGMQ31125 Q1_DGMQ32215 Q1_DGMQ32250 Q2_DGMQ32977 Q1_DGMQ31609 Q1_DGMQ32180 Q1_DGMQ32273 Q1_DGMQ32219 Q2_DGMQ31497 Q0_DGMQ31613 Q0_DGMQ31441 Q2_DGMQ31452 Q0_DGMQ31483 Q1_DGMQ31724 Q2_DGMQ31108 Q0_DGMQ31615 Q1_DGMQ31625 Q3_DGMQ31641 Q2_DGMQ32145 Tuscan Italian (TSI) Q1_DGMQ32200 Q3_DGMQ32311 ASW_NA20278 Q3_DGMQ32044 Q3_DGMQ32060 Q0_DGMQ31179 Q3_DGMQ32059 Q3_DGMQ31406 ASW_NA19921 tion history has determined the structure Q3_DGMQ32285 Q3_DGMQ31425 Q3_DGMQ32220 Q3_DGMQ32160 PUR_HG01108 ASW_NA20342 ASW_NA20334 ASW_NA20336 ASW_NA19923 ASW_NA20351 ASW_NA19922 Q3_DGMQ31036 Q3_DGMQ31058 ASW_NA19914 ASW_NA19819 ASW_NA20126 ASW_NA19713 ASW_NA19985 ASW_NA19707 Persian-South Asian (Q2) ASW_NA19818 E ASW_NA20282 ASW_NA20359 ASW_NA20363 ASW_NA20296 ASW_NA20317 ASW_NA20412 ASW_NA20322 Q3_DGMQ32309 Q3_DGMQ31244 Q3_DGMQ31583 Q3_DGMQ32251 Q3_DGMQ32291 of modern genomes, our identification ASW_NA20289 ASW_NA20341 ASW_NA20332 Admixed African (ASW, Q3) ASW_NA20276 ASW_NA20340 ASW_NA20344 ASW_NA20346 ASW_NA19900 ASW_NA19700 ASW_NA19835 ASW_NA20287 ASW_NA20281 ASW_NA19834 ASW_NA19920 ASW_NA19701 Han Chinese South (CHS) ASW_NA19704 ASW_NA19984 ASW_NA20339 ASW_NA20348 ASW_NA20294 ASW_NA19703 ASW_NA19904 ASW_NA19711 ASW_NA20356 ASW_NA19982 ASW_NA19909 ASW_NA19712 ASW_NA19901 ASW_NA19916 ASW_NA19908 ASW_NA20298 of a cluster of Q1 (Bedouin) as the most ASW_NA20357 ASW_NA19917 ASW_NA20127 LWK_NA19437 LWK_NA19473 LWK_NA19474 LWK_NA19327 LWK_NA19315 LWK_NA19036 LWK_NA19401 LWK_NA19463 LWK_NA19035 Han Chinese Beijing (CHB) LWK_NA19308 LWK_NA19020 LWK_NA19028 LWK_NA19385 LWK_NA19041 LWK_NA19398 LWK_NA19449 LWK_NA19457 LWK_NA19443 LWK_NA19470 LWK_NA19469 LWK_NA19331 LWK_NA19334 LWK_NA19313 LWK_NA19360 LWK_NA19332 LWK_NA19428 LWK_NA19461 LWK_NA19439 LWK_NA19438 distant ancestors of non-Africans is of LWK_NA19310 LWK_NA19431 LWK_NA19324 LWK_NA19383 LWK_NA19394 LWK_NA19372 LWK_NA19462 LWK_NA19351 Japanese (JPT) LWK_NA19446 LWK_NA19038 LWK_NA19399 LWK_NA19429 LWK_NA19391 LWK_NA19404 LWK_NA19456 LWK_NA19377 LWK_NA19455 LWK_NA19319 LWK_NA19373 LWK_NA19374 LWK_NA19371 LWK_NA19436 LWK_NA19440 LWK_NA19390 LWK_NA19468 LWK_NA19346 LWK_NA19395 LWK_NA19309 LWK_NA19359 LWK_NA19384 LWK_NA19448 LWK_NA19451 considerable interest, particularly given LWK_NA19452 Luhuya (LWK) LWK_NA19430 LWK_NA19328 LWK_NA19381 Bedouin (Q1) LWK_NA19382 LWK_NA19380 LWK_NA19379 LWK_NA19044 LWK_NA19321 LWK_NA19338 LWK_NA19396 LWK_NA19397 LWK_NA19350 LWK_NA19046 LWK_NA19375 LWK_NA19393 LWK_NA19307 LWK_NA19312 LWK_NA19316 LWK_NA19445 LWK_NA19453 LWK_NA19434 LWK_NA19444 LWK_NA19355 LWK_NA19347 LWK_NA19352 LWK_NA19376 LWK_NA19317 LWK_NA19318 LWK_NA19471 LWK_NA19466 LWK_NA19467 African American (ASW) the suspected route of migration out of LWK_NA19435 LWK_NA19311 LWK_NA19472 LWK_NA19403 ASW_NA20291 Q3_DGMQ31489 YRI_NA19147 YRI_NA19198 YRI_NA18934 YRI_NA18933 YRI_NA19235 YRI_NA18909 YRI_NA19131 F YRI_NA18508 YRI_NA18871 YRI_NA18916 YRI_NA18516 YRI_NA18917 YRI_NA18908 YRI_NA19236 YRI_NA18853 YRI_NA18868 YRI_NA19138 YRI_NA18870 YRI_NA19248 YRI_NA18874 YRI_NA19114 YRI_NA18507 Subpopulation Unassigned (Q0) YRI_NA18520 YRI_NA18867 YRI_NA18873 YRI_NA19099 Africa and into the surrounding con- YRI_NA19130 YRI_NA19108 YRI_NA19098 YRI_NA19207 YRI_NA19137 YRI_NA19256 YRI_NA18522 YRI_NA19102 YRI_NA19119 YRI_NA19129 YRI_NA18912 YRI_NA19200 YRI_NA19172 YRI_NA18519 YRI_NA18907 YRI_NA18504 YRI_NA18505 YRI_NA18501 YRI_NA19209 YRI_NA18499 YRI_NA18523 YRI_NA19160 YRI_NA19093 YRI_NA19225 African (Q3) YRI_NA18510 YRI_NA19152 YRI_NA18511 YRI_NA19190 YRI_NA19197 YRI_NA18487 Yoruban (YRI) YRI_NA18498 YRI_NA19189 tinents. The possibility that the Q1 YRI_NA19146 YRI_NA19175 YRI_NA19095 YRI_NA19185 YRI_NA19096 YRI_NA19149 YRI_NA19121 YRI_NA19222 YRI_NA19150 YRI_NA18861 YRI_NA19117 YRI_NA18489 YRI_NA18856 YRI_NA18923 YRI_NA19257 YRI_NA19113 YRI_NA19213 YRI_NA18486 YRI_NA19107 Luhuya Kenyan (LWK) YRI_NA19116 YRI_NA19171 YRI_NA19204 YRI_NA18502 YRI_NA18858 YRI_NA18910 YRI_NA18924 YRI_NA19223 YRI_NA19247 YRI_NA18517 YRI_NA19118 (Bedouin) are descendants of the first Yoruba Nigerian (YRI) Eurasians provides an additional piece Figure 6. Neighbor-joining tree hierarchical clustering analysis of the combined Qatari genomes and of the puzzle concerning ancient migra- the 1000 Genomes Project Phase 1 samples based on pairwise proportion of shared alleles calculated tion routes and the establishment of an- across the entire autosome. (A) The entire neighbor-joining tree with each of the branches leading to in- dividuals in the 1000 Genomes samples color-coded by continent (Europeans in shades of purple: CEU, cient non-African populations. FIN, GBR, IBS, TSI; Asians in shades of brown: CHB, CHS, JPT; Africans in shades of orange: LWK, YRI, ASW; Americans in shades of green: CLM, MXL, PUR) and with the Q1 (Bedouin) in red, Q2 (Persian-South Asian) in azure, Q3 (African) in black, and Q0 (Subpopulation Unassigned) in gray. (B) Detail of the three Methods (15%) Q2 (Persian-South Asian) that cluster with Europeans. (C) Detail of the 11 (55%) Q2 (Persian- South Asian) individuals, with three (5%) Q1 (Bedouin), one (5%) Q3 (African), and one (13%) Q0 Ethics statement (Subpopulation Unassigned) that cluster as an outgroup to Asians. (D) Detail of the 50 (89%) Q1 indi- Human subjects were recruited, and writ- viduals, with three (15%) Q2 (Persian-South Asian), one (5%) Q3 (African), and two (25%) Q0 (Subpopulation Unassigned), that cluster outside the Africans and African Ancestry in Southwest US ten informed consent was obtained at and that also cluster as an outgroup to all other non-African populations, indicating that they are the Hamad Medical Corporation (HMC) most distant ancestors of all non-Africans. (E) Detail showing the three (15%) Q1 (Bedouin), three and HMC Primary Health Care Centers (15%) Q2 (Persian-South Asian), 12 (60%) Q3 (African), and four (50%) Q0 (Subpopulation Doha, Qatar, under protocols approved Unassigned) that do not form large clusters but are all located within the admixed cluster. (F) Detail of by the Institutional Review Boards of the one (5%) Q3 (African) that clusters between Yoruba (YRI) and Luhya (LWK). Hamad Medical Corporation and Weill Cornell Medical College in Qatar. generally, this suggests that for any genome-wide association study (GWAS) or rare variant association study (RVAS) of diabetes Inclusion criteria or other complex diseases in Qatar, inference of deep ancestry in Qatar is a peninsula nation on the eastern edge of the Arabian the Arabian Peninsula, using rare variation sampled by genome Peninsula (Supplemental Fig. 1). The population of Qatar includes or exome sequencing, is critical for identifying new disease risk more than 2 million inhabitants, comprised of ∼300,000 nationals genes. Given the dearth of next generation sequencing studies with roots in Qatar predating the discovery of oil and gas conducted in Middle Eastern and Arab populations, these results and establishment of an independent nation in 1970 and the indicate that a considerable number of variants that make impor- more than 1.7 million immigrants who mostly arrived in the past

158 Genome Research www.genome.org Downloaded from genome.cshlp.org on March 16, 2016 - Published by Cold Spring Harbor Laboratory Press

Uninterrupted ancestry in Arabia decade (Qatar Statistics Authority 2013, http://www.qsa.gov.qa/ 11,711,411 autosomal biallelic SNPs. The transition:transversion QatarCensus/Pdf/Population above 15 by educational attainment, ratio of this final set was 2.2, close to values previously observed nationality, age, sex and marital status.pdf). As selection criteria, in the 1000 Genomes Project (The 1000 Genomes Project we required that subjects be third-generation Qataris and all ances- Consortium 2012). Based on the concordance and quality mea- tors were Qatari citizens born in Qatar, as assessed by question- sures, the calls generated from our pipeline were considered to be naires. Recent immigrants or residents of Qatar who traced their high quality, and these were used for all further aspects of this recent ancestry to other geographic regions were excluded. study. After exclusion of four related Qataris (Supplemental Natives of the Arabian Peninsula can be divided into at least Table III), the final integrated call set included 11,711,386 autoso- three genetic subpopulations that reflect the historical migration mal biallelic SNPs in 1196 genomes. patterns in the region: Q1 (Bedouin), Q2 (Persian-South Asian), and Q3 (African) (Hunter-Zinck et al. 2010; Omberg et al. 2012; Integration with Human Origins data set Rodriguez-Flores et al. 2012). A panel of 48 SNPs was genotyped by TaqMan (Life Technologies) sufficient for classification into The 1000 Genomes Project Phase 1 is an excellent resource for rare one of the three subpopulations based on >70% ancestry in one variant discovery; however, it is limited in terms of the breadth of cluster in a STRUCTURE analysis with k = 3 used to identify indi- global populations sampled. Unfortunately, at the time of writing, viduals that could unambiguously be placed in one of these three no global resource of sequenced genomes existed; hence, the next groups (Supplemental Fig. 2; Pritchard et al. 2000; Rodriguez- best alternative for comparison of the Qataris to populations “ ” Flores et al. 2012). Our primary focus was the Q1 (Bedouin) genet- around the world is the Human Origins Fully Public Dataset (re- “ ” ic subpopulation because of its deepest ancestry in Arabia ferred to here as Human Origins [HO]), which includes genotype (Ferdinand et al. 1993), so we selected 60 Q1 (Bedouin) individu- data for 1917 indivduals from Africa, West Eurasia (including als to include in the sample. We additionally selected 20 Q2 Middle East), South Asia, East Asia, Central Asia/Siberia, and (Persian-South Asian) and 20 Q3 (African) to use as controls in America. In particular, the West Eurasian, African, and South the analysis, and an additional eight Q0 (Subpopulation Unas- Asian data sets include populations sampled in countries close to signed) individuals that could not be confidently placed in one Qatar, where detection of shared ancestry is of interest in this of these subpopulations, defined as not having >70% ancestry study. The data set also includes data from archaic genomes, in any of the three groups as determined by STRUCTURE. The such as Altai Neanderthal, Denisova, and chimpanzee, which are total sample therefore included 108 individuals with an even dis- of interest in this study for quantification of Neanderthal ancestry. tribution of males and females (see Supplemental Methods; The Human Origins data set includes a number of samples also Supplemental Table I). present in the 1000 Genomes Project (Supplemental Table IV), and for these samples, the Human Origins overlap data is kept. In order to conduct population genetic analysis on a com- Illumina deep sequencing of the genomes bined data set of the 104 Qatari genomes (QG, n = 104), the 1000 In order to characterize the spectrum of genetic variation, each of Genomes Project Phase 1 (1000G-HO, n = 1028 after exclusion the 108 Qatari genomes were sequenced to a median depth of 37× of duplicates), and Human Origins Fully Public Dataset (HO, n = (minimum 30×) through the Illumina Genome Network (see 1862 after exclusion of archaic genomes, ancient genomes, Supplemental Methods for details). and other genomes not relevant to this study) (Supplemental Table V), a set of sites overlapping between the integrated Qatari Relatedness among Qataris genomes plus the 1000 Genomes Project minus Human Origins, and the Human Origins data set were identified. Of 600,841 Given the high rate of consanguineous marriage previously re- SNPs in the Human Origins data set and 11,711,386 SNPs in the ported in the Qatari population (Hunter-Zinck et al. 2010; Mezza- Qatari genomes plus the 1000 Genomes Project data set, 388,805 villa et al. 2015), we sought to quantify the relatedness between SNPs overlapped. Further filtering was conducted on the data set, individuals in our sample and to exclude closely related individ- pruning SNPs based on linkage disequilibrium using PLINK uals that could potentially confound population genetics analysis (Purcell et al. 2007), “–indep-pairwise 200 25 0.4,” matching pa- methods that assume the input sample is unrelated. In order to con- rameters used previously (Lazaridis et al. 2014). After linkage dise- duct the relatedness analysis, autosomal SNPs in 108 Qatari ge- quilibrium-pruning, the final data set for analysis included nomes (described above) were filtered using PLINK 1.9 (Chang 197,714 SNPs segregating in the three data sets (QG, 1000G-HO, et al. 2015), and relatedness between the 108 Qatari genomes was and HO). assessed using kinship coefficients estimated by KING-robust (Manichaikul et al. 2010) and PREST-plus (McPeek and Sun 2000) (see Supplemental Methods). Both methods found the same five Inbreeding coefficient first-degree and second-degree relationships, in which these rela- In order to place the high reported consanguinity in Qatar in a tionships were then confirmed by investigative reassessment of global context, the inbreeding coefficient was calculated using medical records. One individual from each of the five pairs of rel- PLINK 1.9 (Chang et al. 2015) for Q1 (Bedouin), Q2 (Persian- atives was then excluded from the study. Three of the pairs of rel- South Asian), and Q3 (African) Qataris, the 1000 Genomes atives formed a trio; hence, two individuals were excluded from Project minus Human Origins overlap, and Human Origins popu- the trio, and one individual was excluded from each of the other lations (see Supplemental Methods). two pairs, resulting in exclusion of four relatives in total. Principal component analysis Integration with the 1000 Genomes Project Phase 1 A PCA (Price et al. 2006) wascarried out for the combined 104 Qatari An integrated SNP call set was produced for ancestry analysis for a genomes, the 1000 Genomes Project minus Human Origins over- total of 1200 genomes, combining the 108 Qatari genomes with lap, and Human Origins samples using the 197,714 SNPs in the the 1092 genomes from the 1000 Genomes Project Phase 1 integrated data set (filtering criteria described above). Using the (1000 Genomes) (The 1000 Genomes Project Consortium 2012) results of this large-scale analysis, visual assessment of clustering (see Supplemental Methods). The integrated call set included and population overlap was used to confirm expected relationships

Genome Research 159 www.genome.org Downloaded from genome.cshlp.org on March 16, 2016 - Published by Cold Spring Harbor Laboratory Press

Rodriguez-Flores et al. between the analyzed populations. Four distinct plots of a single the 96 Q1 (Bedouin), Q2 (Persian-South Asian), or Q3 (African) PCA run were constructed: one comparing the Qatari genomes Qatari genomes. A plot of effective population size versus years to the 1000 Genomes populations (Supplemental Fig. 5A), one in the past was generated for each of the genome using instruc- comparing Qataris to the 1000 Genomes and Human Origins tions from the PSMC manual (Li and Durbin 2011; see Samples including two visualizations of the full data set (Fig. 1A, Supplemental Methods). For comparison, the same PSMC pipeline color-coded by regional meta-populations; Supplemental Fig. 5B, was run on BAM files of Illumina deep sequencing reads mapped to color-coded by detailed population), and one comparing Qataris the GRCh37 human reference genome for an individual of to Middle Eastern populations from the Human Origins data European ancestry (NA12878, Utah resident with Northern and set (Fig. 1B). For the latter, in order to compare Qataris to Middle Western European ancestry, CEU) and an individual of African an- Eastern populations with potential for recent shared Bedouin an- cestry (NA19239, Yoruba in Ibadan, Nigeria, YRI) sequenced as cestry with Qataris sampled by the Human Origins data set, popu- part of the 1000 Genomes Pilot (The 1000 Genomes Project lations from the Middle East previously labeled in Lazaridis et al. Consortium 2010). The resulting PSMC plots for these two individ- (2014) as “West Eurasia,” were relabeled as “Middle East,” including uals were shifted slightly, such that they align with Qatari PSMC Bedouin A, Bedouin B, Druze, Egyptian Comas, Egyptian Metspalu, plots at distant (>200,000 yr ago) timescales (Fu et al. 2014). Iranian, Jordanian, Lebanese, Palestinian, Saudi, Syrian, Turkish, Turkish Adana, Turkish Aydin, Turkish Balikesir, Turkish Istanbul, Turkish Kayseri, Turkish Trabzon, and Yemen. Genome-wide admixture analysis In order to learn more about the ancestry of the sampled Qataris, a Y and mitochondria haplogroup assignment genome-wide admixture analysis was conducted on the combined data set of 104 Qatari genomes, the 1000 Genomes Project minus In order to determine the prevalence of known Chr Y and mtDNA Human Origins overlap, and Human Origins using ADMIXTURE haplogroups in Qatar, SNP genotypes were generated simultaneously (Supplemental Methods; Alexander et al. 2009). The cross-valida- for the 108 Qatari genomes using an updated version of GATK tion error was calculated for a range of expected number of ances- (v3.1.1) (DePristo et al. 2011) that supports haploid chromosome tral populations (K), and the K with the lowest cross-validation calling (n =53ChrY,n = 108 mtDNA). For one of the genomes, the error was used to quantify ancestry, in this case K = 12. sample was originally thought to be male but is most likely female due to low call rates on Chr Y. This sample was excluded from Chr Y analysis and X/A diversity analysis, but was included in autoso- African admixture proportion and timing mal and mtDNA analysis. Mean coverage of mapped reads was 11× in Chr Y and 3892× in mtDNA. After exclusion of related and Q0 (Sub- In order to estimate the proportion and timing of African admix- population Unassigned) (admixed) Qataris, the remaining samples ture in Qatari populations, the genomes of Qataris and world pop- included 47 Chr Y and 96 mtDNA. ulations were analyzed using ALDER 1.2 (Supplemental Methods; Haplogroup assignments for the Chr Y and mtDNA were Loh et al. 2013). made using previously characterized variants. For Chr Y, these as- signments were made using YFitter (Jostins et al. 2014) by using Local admixture analysis variants limited to known SNPs cataloged by the International Society of Genetic Genealogy (Jobling and Tyler-Smith 2003) with- An admixture deconvolution analysis was performed on the 96 Q1 in a 10-Mb interval of the Y Chromosome that is known to be ame- (Bedouin), Q2 (Persian-South Asian), or Q3 (African) Qatari ge- nable to analysis based on short read sequencing (Skaletsky et al. nomes using the 11,711,386 autosomal SNPs segregating in both 2003; Poznik et al. 2013). For mtDNA, these assignments were the 1000 Genomes Project and Qatari genomes using SupportMix made using HaploGrep (Kloss-Brandstätter et al. 2011) by using (Supplemental Fig. 9; Supplemental Methods; Omberg et al. 2012). the set of known haplogroup-specific variants in the PhyloTree (van Oven and Kayser 2009) database. In order to quantify the differences between mtDNA and Chr Y Neanderthal ancestry in terms of diversity of the haplogroups identified, the propor- In order to compare the proportion of Neanderthal admixture tion of variance among and within populations was quantified in Q1 (Bedouin) Qataris with that of other populations in the for Chr Y and mtDNA using the AMOVA function in Arlequin 1000 Genomes Project (The 1000 Genomes Project Consortium (Supplemental Methods; Excoffier et al. 1992; Excoffier and Lischer 2012) and Human Origins (Lazaridis et al. 2014), the F4 ratio 2010). The analysis was repeated eight times, including separate (Patterson et al. 2012) and Patterson’s D-statistic (Patterson et al. analysis of Chr Y and mtDNA, for three-way comparison of the pop- 2012) were estimated using the qpF4ratio and qpDstat programs, ulations, as well as all possible two-way comparisons (Q1/Q2, Q1/ respectively, from the ADMIXTOOLS 3.0 package (Supplemental Q3, Q2/Q3). The proportion of variance among and within popula- Methods; Patterson et al. 2012). tions was tabulated, as well as the estimated F and P-value for both. st We additionally considered the expected F4 ratio for the Q1 (Bedouin) under the scenario of no admixture between Comparison of X Chromosome to autosomal diversity Neanderthal and direct ancestors of Q1 (Bedouin), such that ob- served Neanderthal ancestry in Q1 (Bedouin) would be entirely The ratio of X-linked to autosomal nucleotide diversity (X/A) for due to European admixture. From the estimated components of different populations was computed following the approach in the ADMIXTURE analysis with K = 12, the Southern European an- Gottipati et al. (2011) and Arbiza et al. (2014) (Supplemental cestry in the Q1 (Bedouin) is 8.2% on average, and the Northern Methods). European ancestry in Q1 (Bedouin) is 1.3% on average, totaling 9.5% of the genome. If the Q1 (Bedouin) had never mixed with Neanderthal prior to introduction of European admixture, assum- Coalescent analysis ing no selection against introgressed genomic intervals, we would

To infer the extent and timing of bottlenecks, the pairwise sequen- therefore expect an F4 ratio in Q1 (Bedouin) to be on the order of tial Markov coalescent (PSMC) (Li and Durbin 2011) was applied to 1/10 of those observed in European populations.

160 Genome Research www.genome.org Downloaded from genome.cshlp.org on March 16, 2016 - Published by Cold Spring Harbor Laboratory Press

Uninterrupted ancestry in Arabia

TreeMix analysis and novel genomic SNPs have been submitted to NCBI dbSNP (http://www.ncbi.nlm.nih.gov/SNP/) under submitter batch ID We performed a TreeMix analysis (Pickrell and Pritchard 2012) of QG108_GENOMIC_SNPS_20151008 (http://www.ncbi.nlm.nih. the 96 Q1 (Bedouin), Q2 (Persian-South Asian), or Q3 (African) gov/SNP/snp_viewBatch.cgi?sbid=1062298) and submitter handle Qatari genomes and the 1000 Genomes Project excluding admixed WEILL_CORNELL_DGM. PLINK and VCF files of genotypes for populations (Puerto Rican, Mexican, Colombian, and African variants analyzed in this study, both before and after integration Ancestry in Southwest US) (Supplemental Methods). with 1000 Genomes and Human Origins, are available on our web- site http://geneticmedicine.weill.cornell.edu/genome.html. Neighbor-joining tree clustering In order to determine if any of the Qatari genomes were the most Acknowledgments distant ancestors of all non-African populations, neighbor-joining trees were constructed for the 104 Qatari genomes and the 1000 We thank the three reviewers of this manuscript for helpful com- Genomes Project using the 11,711,386 autosomal SNPs segregat- ments and suggestions; the 1000 Genomes Project Consortium for ing in both data sets. For each pair of genomes, the proportion helpful advice on analysis methods; M.R. Staudt, Y. Strulovici- of shared alleles (PSA) (Mountain and Cavalli-Sforza 1997), or 1 Barel, A. Al Shakaki, O.M. Chidiac, R. Mathew, and the WCMC- minus the proportion of the genome identical by state (IBS), was Q Genomics Core for help with the study; and Mezey laboratory calculated using the “–distance -square -1-ibs” function in PLINK students and N. Mohamed for help in preparing this manuscript. 1.9 (Purcell et al. 2007; Chang et al. 2015), which outputs a These studies were supported, in part, by the Qatar Foundation 1196×1196 matrix of distances (1 minus IBS distance or PSA). A and Weill Cornell Medical College in Qatar. J.L.R.F. was supported, neighbor-joining (NJ) tree was constructed using a recently updat- in part, by the National Heart, Lung, and Blood Institute of the ed version of the original NJ (Saitou and Nei 1987) algorithm called National Institutes of Health (NIH) T32 HL09428. NJS (Criscuolo and Gascuel 2008) that is better at handling missing values, as implemented in the APE package in R (Paradis et al. 2004; R Core Team 2014). Overall, this approach is computation- References ally tractable for millions of markers genotyped in thousands of ge- The 1000 Genomes Project Consortium. 2010. A map of human genome nomes and produces similar topologies to maximum-likelihood variation from population-scale sequencing. Nature 467: 1061–1073. clustering methods but requires only a fraction of the compute The 1000 Genomes Project Consortium. 2012. An integrated map of genetic time, where the trade-off is a sacrifice in the accuracy of branch variation from 1,092 human genomes. Nature 491: 56–65. lengths (Tateno et al. 1994). The algorithm takes the distance Abu-Amero KK, González AM, Larruga JM, Bosley TM, Cabrera VM. 2007. Eurasian and African mitochondrial DNA influences in the Saudi matrix as input and outputs a tree. In order to confirm the robust- Arabian population. BMC Evol Biol 7: 32. ness to sample ordering, the order of samples in the matrix was Abu-Amero KK, Larruga JM, Cabrera VM, González AM. 2008. Mitochondri- shuffled and reclustered 100 times, in which all reclusterings re- al DNA structure in the Arabian Peninsula. BMC Evol Biol 8: 45. covered the same tree. In order to produce bootstrap support val- Abu-Amero KK, Hellani A, González AM, Larruga JM, Cabrera VM, Underhill PA. 2009. Saudi Arabian Y-Chromosome diversity and its re- ues for the tree, 100 reclusterings of the tree were generated lationship with nearby regions. BMC Genet 10: 59. based on random sampling of SNPs. For each bootstrap iteration, Alexander DH, Novembre J, Lange K. 2009. Fast model-based estimation of 11,711,386 random (with replacement) SNPs were selected using ancestry in unrelated individuals. Genome Res 19: 1655–1664. a Python script (www.python.org), and then the PSA distance Alsmadi O, Thareja G, Alkayal F, Rajagopalan R, John SE, Hebbar P, Behbehani K, Thanaraj TA. 2013. Genetic substructure of Kuwaiti pop- matrix and NJ tree were recalculated using these SNPs. Bootstrap ulation reveals migration history. PLoS One 8: e74913. support was calculated using the Python package SumTrees Alsmadi O, John SE, Thareja G, Hebbar P, Antony D, Behbehani K, Thanaraj (Sukumaran and Holder 2010). TA. 2014. Genome at juncture of early human migration: a systematic For visualization, the tree was rooted at the most recent com- analysis of two whole genomes and thirteen exomes from Kuwaiti pop- ulation subgroup of inferred Saudi Arabian tribe ancestry. PLoS One 9: mon ancestor (MRCA) node of the largest cluster of the 1000 e99069. Genomes Yoruba (YRI) genomes in the tree. A color version of Arbiza L, Gottipati S, Siepel A, Keinan A. 2014. Contrasting X-linked and au- the tree was produced using TreeGraph 2 (Stöver and Müller tosomal diversity across 14 human populations. Am J Hum Genet 94: – 2010) by manually coloring the branches leading to each node. 827 844. Armitage SJ, Jasim SA, Marks AE, Parker AG, Usik VI, Uerpmann HP. 2011. A single color is assigned to each population, with populations The southern route “out of Africa”: evidence for an early expansion of from the same continent having similar colors: Europeans in modern humans into Arabia. Science 331: 453–456. shades of purple, Asians in shades of brown, Americans in shades Behar DM, Yunusbayev B, Metspalu M, Metspalu E, Rosset S, Parik J, Rootsi of green, Africans in shades of orange, Q1 (Bedouin) in red, Q2 S, Chaubey G, Kutuev I, Yudkovsky G, et al. 2010. The genome-wide structure of the Jewish people. Nature 466: 238–242. (Persian-South Asian) in blue, Q3 (Sub-Saharan African) in black, Cann RL, Stoneking M, Wilson AC. 1987. Mitochondrial DNA and human and Q0 (Subpopulation Unassigned) in gray. When a cluster of evolution. Nature 325: 31–36. nodes includes different populations, the terminal branches were Cavalli-Sforza LL, Feldman MW. 2003. The application of molecular genetic 33( ): given population-specific colors, whereas the shared higher-order approaches to the study of human evolution. Nat Genet Suppl 266–275. branches for the cluster were given the color of the population in Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. 2015. majority. For example, if 10 Q1 (Bedouin) and 1 Q0 (Subpopula- Second-generation PLINK: rising to the challenge of larger and richer tion Unassigned) were in a cluster, the branches above where the datasets. Gigascience 4: 7. nodes come together were colored red. Criscuolo A, Gascuel O. 2008. Fast NJ-like algorithms to deal with incom- plete distance matrices. BMC Bioinformatics 9: 166. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, et al. 2011. A frame- Data access work for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43: 491–498. The sequence data generated for this study in BAM format, as well Esposito JL. 2001. Women in muslim family law, 2nd ed. Syracuse Universtiy as SNP genotypes in VCF format, have been submitted to the NCBI Press, Syracuse, NY. Excoffier L, Lischer HE. 2010. Arlequin suite ver 3.5: a new series of pro- Sequence Read Archive (SRA; http://www.ncbi.nlm.nih.gov/sra/) grams to perform population genetics analyses under Linux and under accession number SRP060765. Allele frequencies for known Windows. Mol Ecol Resour 10: 564–567.

Genome Research 161 www.genome.org Downloaded from genome.cshlp.org on March 16, 2016 - Published by Cold Spring Harbor Laboratory Press

Rodriguez-Flores et al.

Excoffier L, Smouse PE, Quattro JM. 1992. Analysis of molecular variance in- Patterson N, Richter DJ, Gnerre S, Lander ES, Reich D. 2006. Genetic evi- ferred from metric distances among DNA haplotypes: application to hu- dence for complex speciation of humans and chimpanzees. Nature man mitochondrial DNA restriction data. Genetics 131: 479–491. 441: 1103–1108. Ferdinand K, Nicolaisen I, Carlsberg Foundation’s Nomad Research Project. Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, Genschoreck 1993. Bedouins of Qatar. Thames and Hudson, New York. T, Webster T, Reich D. 2012. Ancient admixture in human history. Fu Q, Li H, Moorjani P, Jay F, Slepchenko SM, Bondarev AA, Johnson PL, Genetics 192: 1065–1093. Aximu-Petri A, Prüfer K, de FC, et al. 2014. Genome sequence of a Pérez-Miranda AM, Alfonso-Sánchez MA, Peña JA, Herrera RJ. 2006. Qatari 45,000-year-old modern human from western Siberia. Nature 514: DNA variation at a crossroad of human migrations. Hum Hered 61: 445–449. 67–79. Gottipati S, Arbiza L, Siepel A, Clark AG, Keinan A. 2011. Analyses of X- Pickrell JK, Pritchard JK. 2012. Inference of population splits and mixtures 8: linked and autosomal genetic variation in population-scale whole ge- from genome-wide allele frequency data. PLoS Genet e1002967. nome sequencing. Nat Genet 43: 741–743. Pickrell JK, Patterson N, Loh PR, Lipson M, Berger B, Stoneking M, Gronau I, Hubisz MJ, Gulko B, Danko CG, Siepel A. 2011. Bayesian inference Pakendorf B, Reich D. 2014. Ancient west Eurasian ancestry in southern 111: – of ancient human demography from individual genome sequences. Nat and eastern Africa. Proc Natl Acad Sci 2632 2637. Genet 43: 1031–1034. Poznik GD, Henn BM, Yee MC, Sliwerska E, Euskirchen GM, Lin AA, Snyder Hammer MF, Karafet T, Rasanayagam A, Wood ET, Altheide TK, Jenkins T, M, Quintana-Murci L, Kidd JM, Underhill PA, et al. 2013. Sequencing Y chromosomes resolves discrepancy in time to common ancestor of Griffiths RC, Templeton AR, Zegura SL. 1998. Out of Africa and back 341: – again: nested cladistic analysis of human Y chromosome variation. males versus females. Science 562 565. Mol Biol Evol 15: 427–441. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Henn BM, Cavalli-Sforza LL, Feldman MW. 2012. The great human expan- 2006. Principal components analysis corrects for stratification in ge- nome-wide association studies. Nat Genet 38: 904–909. sion. Proc Natl Acad Sci 109: 17758–17764. Pritchard JK, Stephens M, Donnelly P. 2000. Inference of population struc- Hershkovitz I, Marder O, Ayalon A, Bar-Matthews M, Yasur G, Boaretto E, ture using multilocus genotype data. Genetics 155: 945–959. Caracuta V, Alex B, Frumkin A, Goder-Goldberger M, et al. 2015. Prüfer K, Racimo F, Patterson N, Jay F, Sankararaman S, Sawyer S, Heinze A, Levantine cranium from Manot Cave (Israel) foreshadows the first Renaud G, Sudmant PH, de Filippo C, et al. 2014. The complete genome European modern humans. Nature 520: 216–219. sequence of a Neanderthal from the Altai Mountains. Nature 505: Hochreiter S. 2013. HapFABIA: identification of very short segments of 43–49. identity by descent characterized by rare variants in large sequencing 41: Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, data. Nucleic Acids Res e202. Sklar P, de Bakker PI, Daly MJ, et al. 2007. PLINK: a tool set for whole-ge- Hodgson JA, Mulligan CJ, Al-Meeri A, Raaum RL. 2014. Early back-to-Africa Am J Hum 10: nome association and population-based linkage analyses. migration into the Horn of Africa. PLoS Genet e1004393. Genet 81: 559–575. Hunter-Zinck H, Musharoff S, Salit J, Al-Ali KA, Chouchane L, Gohar A, R Core Team. 2014. R: a language and environment for statistical computing. Matthews R, Butler MW, Fuller J, Hackett NR, et al. 2010. Population ge- R Foundation for Statistical Computing, Vienna, Austria. http://www. 87: – netic structure of the people of Qatar. Am J Hum Genet 17 25. R-project.org/. Jobling MA, Tyler-Smith C. 2003. The human Y chromosome: an evolution- Rasmussen M, Guo X, Wang Y, Lohmueller KE, Rasmussen S, Albrechtsen A, 4: – ary marker comes of age. Nat Rev Genet 598 612. Skotte L, Lindgreen S, Metspalu M, Jombart T, et al. 2011. An Aboriginal John SE, Thareja G, Hebbar P, Behbehani K, Thanaraj TA, Alsmadi O. 2015. Australian genome reveals separate human dispersals into Asia. Science Kuwaiti population subgroup of nomadic Bedouin ancestry—whole ge- 334: 94–98. nome sequence and analysis. Genom Data 3: 116–127. Rodriguez-Flores JL, Fuller J, Hackett NR, Salit J, Malek JA, Al-Dous E, Jostins L, Xu Y, McCarthy S, Ayub Q, Durbin R, Barrett J, Tyler-Smith C. Chouchane L, Zirie M, Jayoussi A, Mahmoud MA, et al. 2012. Exome se- 2014. YFitter: maximum likelihood assignment of Y chromosome hap- quencing of only seven Qataris identifies potentially deleterious vari- logroups from low-coverage sequence data. arXiv: 1407.7988. ants in the Qatari population. PLoS One 7: e47614. Kloss-Brandstätter A, Pacher D, Schönherr S, Weissensteiner H, Binna R, Rodriguez-Flores JL, Fakhro K, Hackett NR, Salit J, Fuller J, Agosto-Perez F, Specht G, Kronenberg F. 2011. HaploGrep: a fast and reliable algorithm Gharbiah M, Malek JA, Zirie M, Jayyousi A, et al. 2014. Exome sequenc- for automatic classification of mitochondrial DNA haplogroups. Hum ing identifies potential risk variants for Mendelian disorders at high Mutat 32: 25–32. prevalence in Qatar. Hum Mutat 35: 105–116. Kopelman NM, Stone L, Gascuel O, Rosenberg NA. 2013. The behavior of Rowold DJ, Luis JR, Terreros MC, Herrera RJ. 2007. Mitochondrial DNA admixed populations in neighbor-joining inference of population trees. geneflow indicates preferred usage of the Levant Corridor over the Pac Symp Biocomput 273–284. Horn of Africa passageway. J Hum Genet 52: 436–447. Lazaridis I, Patterson N, Mittnik A, Renaud G, Mallick S, Kirsanow K, Saitou N, Nei M. 1987. The neighbor-joining method: a new method for re- Sudmant PH, Schraiber JG, Castellano S, Lipson M, et al. 2014. constructing phylogenetic trees. Mol Biol Evol 4: 406–425. Ancient human genomes suggest three ancestral populations for Sandridge AL, Takeddin J, Al-Kaabi E, Frances Y. 2010. Consanguinity in present-day Europeans. Nature 513: 409–413. Qatar: knowledge, attitude and practice in a population born between Li H, Durbin R. 2011. Inference of human population history from individ- 1946 and 1991. J Biosoc Sci 42: 59–82. ual whole-genome sequences. Nature 475: 493–496. Schiffels S, Durbin R. 2014. Inferring human population size and separation Loh PR, Lipson M, Patterson N, Moorjani P, Pickrell JK, Reich D, Berger B. history from multiple genome sequences. Nat Genet 46: 919–925. 2013. Inferring admixture histories of human populations using linkage Shi H, Zhong H, Peng Y, Dong YL, Qi XB, Zhang F, Liu LF, Tan SJ, Ma RZ, disequilibrium. Genetics 193: 1233–1254. Xiao CJ, et al. 2008. Y chromosome evidence of earliest modern human Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM. 2010. settlement in East Asia and multiple origins of Tibetan and Japanese 6: Robust relationship inference in genome-wide association studies. populations. BMC Biol 45. Bioinformatics 26: 2867–2873. Shriner D, Tekola-Ayele F, Adeyemo A, Rotimi CN. 2014. Genome-wide ge- Markus B, Alshafee I, Birk OS. 2014. Deciphering the fine-structure of tribal notype and sequence-based reconstruction of the 140,000 year history 4: admixture in the Bedouin population using genomic data. Heredity of modern human ancestry. Sci Rep 6055. (Edinb.) 112: 182–189. Skaletsky H, Kuroda-Kawaguchi T, Minx PJ, Cordum HS, Hillier L, Brown Mathieson I, McVean G. 2014. Demography and the age of rare variants. LG, Repping S, Pyntikova T, Ali J, Bieri T, et al. 2003. The male-specific PLoS Genet 10: e1004528. region of the human Y chromosome is a mosaic of discrete sequence classes. Nature 423: 825–837. McPeek MS, Sun L. 2000. Statistical tests for detection of misspecified rela- Stöver BC, Müller KF. 2010. TreeGraph 2: combining and visualizing evi- tionships by use of genome-screen data. Am J Hum Genet 66: 1076–1094. dence from different phylogenetic analyses. BMC Bioinformatics 11: 7. Mezzavilla M, Vozzi D, Badii R, Alkowari MK, Abdulhadi K, Girotto G, Sukumaran J, Holder MT. 2010. DendroPy: a Python library for phylogenet- Gasparini P. 2015. Increased rate of deleterious variants in long runs ic computing. Bioinformatics 26: 1569–1571. of homozygosity of an inbred population from Qatar. Hum Hered 79: Tateno Y, Takezaki N, Nei M. 1994. Relative efficiencies of the maximum- 14–19. likelihood, neighbor-joining, and maximum-parsimony methods Mountain JL, Cavalli-Sforza LL. 1997. Multilocus genotypes, a tree of indi- 11: – 61: – when substitution rate varies with site. Mol Biol Evol 261 277. viduals, and human evolutionary history. Am J Hum Genet 705 718. van Oven M, Kayser M. 2009. Updated comprehensive phylogenetic tree of Omberg L, Salit J, Hackett N, Fuller J, Matthew R, Chouchane L, Rodriguez- global human mitochondrial DNA variation. Hum Mutat 30: E386– Flores JL, Bustamante C, Crystal RG, Mezey JG. 2012. Inferring genome- E394. wide patterns of admixture in Qataris using fifty-five ancestral popula- tions. BMC Genet 13: 49. Paradis E, Claude J, Strimmer K. 2004. APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20: 289–290. Received March 13, 2015; accepted in revised form December 15, 2015.

162 Genome Research www.genome.org Downloaded from genome.cshlp.org on March 16, 2016 - Published by Cold Spring Harbor Laboratory Press

Indigenous Arabs are descendants of the earliest split from ancient Eurasian populations

Juan L. Rodriguez-Flores, Khalid Fakhro, Francisco Agosto-Perez, et al.

Genome Res. 2016 26: 151-162 originally published online January 4, 2016 Access the most recent version at doi:10.1101/gr.191478.115

Supplemental http://genome.cshlp.org/content/suppl/2015/12/21/gr.191478.115.DC1.html Material

References This article cites 65 articles, 17 of which can be accessed free at: http://genome.cshlp.org/content/26/2/151.full.html#ref-list-1

Open Access Freely available online through the Genome Research Open Access option.

Creative This article, published in Genome Research, is available under a Creative Commons Commons License (Attribution 4.0 International), as described at License http://creativecommons.org/licenses/by/4.0/.

Email Alerting Receive free email alerts when new articles cite this article - sign up in the box at the Service top right corner of the article or click here.

To subscribe to Genome Research go to: http://genome.cshlp.org/subscriptions

© 2016 Rodriguez-Flores et al.; Published by Cold Spring Harbor Laboratory Press