Human Migration Population Studies Using Multiple Data Types

Deutsch 1 Ashley Deutsch

Dr. Ely

BIOL 303: Genetics

November 1, 2014

The human species has a worldwide distribution yet the most recent common human ancestor was located in Africa 100,000 to 200,000 years ago (Pakendorf and Stoneking 2005). Due to this recent origin, the human population is difficult to study because of a low number of accumulated polymorphisms useful for study (Wirth 2004). The global distribution of the human species has been shaped by demographic events, such as migration and variation of population size over time (Via et al. 2011). Understanding DNA variation among human populations is of importance for medicine, developmental biology, and creating a history of H. sapiens (Cavalli-

Sforza and Feldman 2003). Haploid markers, mitochondrial DNA and Y-chromosomes, are often useful in human population studies (Cavalli-Sforza and Feldman 2003); however, in inferring population history, the incorporation of multiple data types is essential (Via et al. 2011). This incorporation can be done by supplementing data from one haploid marker with another haploid marker or nuclear DNA variation (Pakendorf and Stoneking 2005) or by incorporating social and geographical information, including historical and archaeological data, with the genetic information

(Via et al. 2011). Less traditionally yet with growing implementation (Devi et al. 2007), language and cultural traits, along with commensals and parasites that modern humans brought with them during migration, that have co-evolved with humans can be used to support genetic data in migration studies (Cavalli-Sforza and Feldman 2003). According to Cavalli-Sforza and Feldman

(2003), the use of multidisciplinary approaches in these ways has been essential to advances in the understanding of human evolutionary history. Deutsch 2 A recent study by Secher et al. (2014) utilized a multidisciplinary approach to population genetics. In this study, Secher et al. looked at mitochondrial (mt) DNA of 230 individuals from

Africa, Europe, and the Middle East, who belonged to the U6 haplogroup, in order to construct a phylogeny. mtDNA is inherited only maternally without recombination and with high mutation rates making it a good marker for tracing lineages (Pakendorf and Stoneking 2005). This study looked at mutations within the mtDNA highly variable region (HVR) 1 in order to track dispersion and create the phylogeny. The U6 haplogroup is a group of similar mtDNA sequences with a common ancestor that is made up, mainly, of individuals in North Africa. The 230 samples of mtDNA were sequenced and the differences between individual sequences were analyzed. Similar studies had previously been conducted, however, with smaller sample sizes and less complete sequencing, making this study’s results more exact. The study used the mtDNA mutation accumulation rate of one mutation every 3624 years previously published by Soares et al. (2009) to estimate the coalescence age of each haplotype and subhaplotype within U6. Because of the low rate of mutation, it can be assumed that all individuals with the same mutation share a common ancestor from which they descended and that haplotypes with a greater number of mutations are more distantly related than those with few. Using these data along with the calculated coalescence ages, Secher et al. constructed a phylogenetic tree with branching ages for all of the 230 sequences in the study. The most recent common ancestor of the U6 line was calculated using the mutation rate to have lived approximately 35.3 kya (thousand years ago). This conclusion is similar to the results of previous studies. This result was then analyzed with respect to climate, as is important in population studies. The time of the common ancestor occurred during the Early Upper Paleolithic error before the glacial maximum, yet during a time that was cold and dry enough to force individuals to follow a North African coastal route. The data were then used to construct a phylogeography, the genetic and geographic distribution of a species. Using only the HVR1 sequences, the authors mapped the distribution of U6 and its sub-groups by analyzing the frequencies throughout Africa, Europe and the Middle East (figure 1). This figure shows that the Deutsch 3 total U6a haplogroup (totU6a) has an area of high frequency in northeast and southwest Africa. The map of U6a without the 16189 transition mutation (U6a) has a high frequency only in northeast

Africa suggesting this region as the origin of this haplogroup. The authors concluded that more data is needed to evaluate the most probable origin of the U6 haplogroup. A calculation was then performed to estimate the origin of the U6 haplogroup outside of Africa using anthropological, climate and geographic data. The authors made the assumption that the North African coastal route was 5,000 km. To travel this distance, the individuals with the U6 haplotype would have had to

migrate at a rate of 11.2 km/year, which is

reasonable for Paleolithic hunter gatherers.

Given the mutation rate, there is likely a

7,000 year gap between the formation of the

U macro-haplogroup and the U6

haplogroup. The mutation rate and

migration rate together indicate that the

origin of the U macrohaplogroup was about

4,000 km outside of Africa, in Eurasia. This

migration route prediction, which included

migration rate and climate, correlated with

archeological data.

The migration routes of later

branches of U6 sub-haplogroups were also proposed by combining genetic data with climatic and archeological data. One instance in which climate data was used by the authors in combination with frequencies was in analyzing the U6a2 branch. This branch showed radiation centered in Ethiopia 20 kya. That period was one of maximal aridity in North Africa, making it unlikely that the migration back to East Africa occurred across the

Sahara desert. The authors instead proposed a gradual migration of small groups over a long period Deutsch 4 as the mechanism for this migration. Another example was the use of archaeological data in supporting the radiation in Morocco around 26 kya. It had previously been suggested that this migration was associated with the Aterian, but the authors found that it more likely correlates with the Iberomaurusian archaeological data.

This study attested to the reliability of such a multidisciplinary approach by using their method to explain their data with migration patterns in the post-colonial era, which is known through historical records, successfully. The analysis approach in this experiment utilized a comparison of the phylogeny based on the uniparental marker, mtDNA with archaeological, geographic, climate, and anthropological information. The authors say that this method of complete genome sequencing along with complex statistical analysis will model the future of studies of population genetics.

Trivedi et al. (2008) also conducted a study with haploid markers, analyzing their results with respect to other data; however, their research analyzed Y-chromosome data along with linguistic, sociocultural, and geographic data to make predictions about human migration in India.

The results were then compared to mtDNA results from previous studies. Y-chromosomes are a good indicator of human population relationships, because they are restricted to the male germ line and undergo limited recombination during meiosis, yet produce a relatively high number of mutations causing variability (Hughes and Rozen 2012). Trivedi et al. collected blood samples from

1152 unrelated males from 80 populations varying in linguistic family (Indo-European, Austro-

Asiatic, Dravidian and Tibeto-Burman), socio-ethnic association, and geographic areas within India in order to analyze the Y-chromosomes. Additionally, 282 Indian samples from Punjab, Konkanstha

Brahmin, Koya, Yerava, Mullukunan, Kuruchian, and Koraga populations and 3,047 samples from

76 populations outside of India were used from literature for an analysis of distance and origin. The

Y-chromosomes were analyzed for 38 previously described binary polymorphisms. From these data, haplotypes of short tandem repeat sequences on the Y-chromosome (Y-STR) and single nucleotide polymorphisms on the Y-chromosome (Y-SNPs) were constructed. Genetic differences Deutsch 5 were analyzed among socio-ethnic, linguistic, and geographic groups at a level between individuals of a population, between populations of a group, and between groups of populations. The authors provided an estimate of the time of the most recent common ancestor by analyzing seven Y-STR loci (DYS19, DYS389I, DYS389II, DYS390, DYS391, DYS392 and DYS393) with a generation time of 25 years and a mutation rate of 6.9X10-4 as described in Zhivotovsky et al. (2004).

After sequencing, 24 paternal lineages were observed within India. This haplogroup diversity is higher than that of Europe or East Asia and more similar to Central Asia. Austro-Asiatic and Tibeto-Burman tribes showed a decreased diversity from the total sample. The linguistic comparison showed that Dravidian populations had a higher diversity than Indo-European speaking populations. Additionally, geographic analysis showed that groups in South India had higher haplogroup diversity than did the groups in North India. This data was compiled into a phylogeographic distribution with haplogroup frequencies in figure 2. The amount of variation due among geographical regions was determined to be greater than that among populations within the regions, indicating that geographic distribution was most significant; however, geographic distribution also tended to mirror linguistic family distribution. The socio-ethnic influence could not be seen from a geographic distribution and there was little difference between the genetic distribution of castes and that of tribes. The coalescence ages calculated for the haplogroups R1a1,

H, C, O2a, and R2 with estimated most recent common ancestors at around 32 kya, 44 kya, 49 kya,

36 kya, and 40 kya, respectively. The archaeological evidence from this period is very sparse, so human migration predictions were made through other methods.

A phylogenetic tree constructed for the samples (figure 3) showed that the haplotype distributions were closely linked with linguistic families. The Austro-Asiatic and Tibeto-Burman speakers were clustered around the O2a haplotype separate from the Indo-European and Dravidian clusters. The results of this comparison of Y-chromosome haplogroups including those of South

East Asia indicated a connection between Austro-Asiatic and Tibeto-Burman speakers and the Deutsch 6 South East Asian population. The other populations demonstrated a connection with Indo-European speakers in Central Asia and Eastern Europe.

Using the collected and analyzed data, the authors made predictions about evolutionary events, such as founder effect, gene flow, and genetic drift, as well as factors, including geographic, linguistic and cultural barriers, which produced the Indian patrilineal distribution they observed.

The large diversity of Y-haplogroups in India suggested an early settlement. Deutsch 7 Deutsch 8 The four major haplogroups identified in this study were H, R1a1, O2a, and R2. The H haplogroup was found to be mainly confined to India. The authors suggested that this haplogroup was associated with an eastward migration during the late Pleistocene through the Leventine corridor by viewing the data with accompanying mtDNA haplogroups, a strategy that provides a more complete picture. The H-M69 group shows a fairly uniform distribution across different populations, which indicated an appearance early in the lineage cluster. The R1a1 haplogroup demonstrated high STR variance, indicating a period of population growth and expansion. These data were consistent with an early migration from Central or South Asia. Analysis of the R2 haplogroup with respect to geographic distribution indicated an Indian origin for this haplogroup as well as several small migrations out of India. Its uneven distribution within India as depicted in figure 2, was attributed by the authors to genetic drift or bottleneck rather than migration. In these data, the lack of C haplogroup sub-lineages indicated to the authors that most Indian populations originated within the subcontinent, which indicated that the theory of Aryan migration into India from Central Asia was incorrect. Although socio-ethnic factors, mainly the caste system, are a large part of Indian society, no significant variation was found between caste groups and tribes. This result was mirrored in the previously published mtDNA results that the sample was compared to. This evidence again supports the hypothesis that Indian populations are likely derived from common settlers during the

Pleistocene era. The limited gene flow from Europe and Central and West Asia is further indicated by agricultural evidence in the lack of Neolithic farming markers. They predict that agriculture arose in India through the earliest migration of Dravidian speakers and then again later through the migration of rice cultivators from Southeast Asia. The linguistic analysis of the Austro-Asiatic and

Tibeto-Burman language families provided interesting data. The low haplotype diversity in these populations indicated a demographic event that reduced diversity. The authors suggested a common founder event followed by a bottleneck. The linguistic branches within Tibeto-Burman populations were evident in the haplogroup diversity while they were not in the Austro-Asiatic populations. The authors hypothesized that this result is due to the occurrence of multiple migration events of Tibeto- Deutsch 9 Burman speakers and only a single Austro-Asiatic migration event which occurred earlier.

Geographic distribution supports this hypothesis. mtDNA from a previous study by Metspalu et al.

(2004) indicated that Tibeto-Burman speakers contributed many maternal lineages while a study by

Thangaraj et al. (2005) indicated an absence of markers in Austro-Asiatic tribes. From this information accompanying their own data, the authors predict that either the migration from South

East Asia was male dominant or that the mtDNA has been completely lost. This conclusion was supported by agricultural expansion data.

Trivedi et al. (2008) proposed that settlers of South India were the original settlers of the continent, rather than previously assumed Austro-Asiatic tribes. This prediction was supported through the geographical and linguistic comparison in this study. A multidisciplinary approach in this study, including analysis of data with respect to agricultural histories, linguistic distributions, geographic information, and a comparison to mtDNA from the same region, allowed for more conclusions to be drawn and a more complete migration picture to be formed from the data than would have analysis of Y-chromosome data alone.

A study by Devi et al. (2007) took a very different approach to studying human migration from analyzing haploid markers as was done by Trivedi et al. (2008) and Secher et al. (2014) The study analyzed the variation in the species Helicobacter pylori, a highly variable bacteria that has colonized human stomachs, co-evolving with humans, and is transmitted primarily vertically

(Dominguez-Bell and Blaser 2011). The vertical transmission, a transmission directly from mother to child, allows for this bacteria to be useful in tracking lineages (Dominguez-Bello and Blaser

2011) in a more specific way than mtDNA (Wirth et al. 2004). This study analyzed its data with respect to geography, culture, religion, and linguistics as well as mtDNA published in other studies to produce conclusions about waves of human migration within India. A total of 63 H. pylori samples were collected from native Indian people primarily of Aryan and Dravidian ancestry. For each sample, a 600 base pair region from each of 7 housekeeping genes, atpA, efp, ureI, ppa, mutY, trpC and yphC was sequenced and seven haplotypes were identified. Deutsch 10 An additional 600 sequences from other databases were used for phylogenetic tree construction.

Almost all of the strains in the study were found to be most similar to those from the European H. pylori subpopulation (hpEurope). This similarity indicates a phylogenetic relationship between the

H. pylori populations of Europe and India and by extension, the people. The data from the 400 previously published sequences of H. pylori from geographically and ethnically diverse hosts and the newly sequenced H. pylori were used to create a geographic distribution of global population structure in figure 4 (left). The phylogeny was constructed based on 650 mutation positions and demonstrates that H. pylori spread out of Africa mirroring the spread of humans, consistent with the co-evolution of these species. The branches were spread for greater clarity to the right of the figure.

This result revealed clear

geographic distribution of

sub-populations and

populations. All of the

isolates from North and

South India and two from

Ladakh were clustered

under the hpEurope

population (green).

Seventeen of the Ladakhi sequences were clustered on one hpAsia2 branch (gray). Further distinction within the hpEurope population was made by analyzing an additional 650 mutation sites in order to create a more specific phylogeny in the center box of the figure.

H. pylori samples were also sequenced at the cag Pathogenicity Island (cagPAI) locus. This locus is a region responsible for translocation into the host gastric epithelial cells (Terry et al. 2005).

This analysis showed that cagPAI sequences in India within this sample (red) were part of the

European cluster (figure 5). From the use of geographic data in combination with genetic Deutsch 11 frequencies, in both figure 4 and 5, the authors hypothesized that H. pylori was likely introduced to

India by Indo-European people, which is consistent with the idea of gene flow into India from Indo-

Aryans. They suggested that this event occurred at the same

time as the Indo-European languages arrived, between 4000

and 10000 ya. In a comparison with mtDNA results by Kivisild

et al. (1999), the data does not rule out an alternative

possibility that the common origin of Indian and European

strains could have occurred much earlier during the upper

Paleolithic migration of humans in Eurasia. In comparison with

other indications of migration into India prior to Indo-Asiatic

and from other places, the authors hypothesized that H. pylori

were present in the Indian population prior to the Indo-Asiatic

migration; however, the cagPAI strain from Europe outcompeted those of other locations. The results showed a homogeneous population make up regardless of religion and language, making these not useful for making additional conclusions. The

H. pylori do not have a known mutation rate and, therefore, could not be used for estimations of the dates of migration events. This lack of date information prevents comparison with historical and archaeological data. Thus, the conclusions from this study were heavily reliant on the comparison of genetic data with geographic distribution data.

The studies conducted by Secher et al. (2014), Trivedi et al. (2008), and Devi et al. (2007) demonstrate the usefulness of comparing genetic data with other forms of data in order to produce a greater number of valuable conclusions in studies of human migration. The increased effectiveness of a multidisciplinary approach, however, spreads beyond studies of human migration to other kinds of population studies (Cavalli-Sforza and Feldman 2003). Equally the methods are not confined to those in the aforementioned studies. The use of co-evolved species to infer migration patterns and Deutsch 12 phylogeny can apply to many species, including Mycobacterium tuberculosis and Hepatitis viruses

(Dominguez-Bello and Blaser 2011). Also, in addition to uniparental markers, variation in nuclear

DNA can be used to construct phylogeny (Pankendorf and Stoneking 2005). These methods in conjunction with geographical or cultural data can provide support for the genetic data. These multidisciplinary methods as a whole provide context to genetic information that allows for advancement of the knowledge in the field of human migration (Cavalli-Sforza and Feldman 2003).

Literature Cited:

Cavalli-Sforza LL, Feldman MW (2003) The application of molecular genetic approaches to the study of human evolution. Nat Genet 33:266-275 Devi SM, Ahmed I, Francalacci P, Hussain MA, Akhter Y, Alvi A, Sechi LA, Mégraud F, Ahmed N (2007) Ancestral European roots of Helicobacter pylori in India. BMC Genomics 8:184 doi: 10.1186/1471-2164-8-184 Dominguez-Bello MG, Blaser MJ (2011) The human microbiota as a marker for migration of individuals and populations. Annu Rev Anthro 40:451-474 doi:10.1146/annurev-anthro- 081309-145711 Hughes JF, Rozen S (2012). Genomics and genetics of human and primate Y chromosomes. Annu Rev Genom 13:83-108 doi: 10.1146/annurev-genom-090711-163855 Kivisild T, Bamshad MJ, Kaldma K, Metspalu M, Metspalu E, Reidla M, Laos S, Parik J, Watkins WS, Dixon ME, Papiha SS, Mastana SS, Mir MR, Ferak V, Villems R (1999) Deep common ancestry of Indian and western-Eurasian mitochondrial DNA lineages. Curr Biol 9:1331-1334. Metspalu M, Kivisild T, Metspalu E, Parik J, Hudjashov G, Kaldma K, Serk P, Karmin M, Behar DM, Gilbert MT, Endicott P, Mastana S, Papiha SS, Skorecki K, Torroni A, Villems R (2004) Most of the extant mtDNA boundaries in south and southwest Asia were likely shaped during the initial settlement of Eurasia by anatomically modern humans. BMC Genet, 5: 26 Pakendorf B, Stoneking M (2005) Mitochondrial DNA and human evolution. Annu Rev Genom 6:165-183 doi: 10.1146/annurev.genom.6.080604.162249 Secher B, Fregel R, Larruga JM, Cabrera VM, Endicott P, Pestano JJ, González AM (2014) The history of the North African mitochondrial DNA haplogroup U6 gene flow into the African, Eurasian, and American continents. BMC Evol Biol 14:109 doi:10.1186/1471-2148-14-109 Soares P, Ermini L, Thomson N, Mormina M, Rito T, Rohl A, Salas A, Oppenheimer S, Macaulay V, Richards MB (2009) Correcting for purifying selection: an improved human mitochondrial molecular clock. Am J Hum Genet 84:740-759 Deutsch 13 Terry CE, McGinnis LM, Madigan KC, Cao P, Cover TL, Liechti GW, Peek RM, Forsyth MH (2005) Genomic comparison of cag Pathogenicity Island (PAI)-positive and -negative Helicobacter pylori strains: Identification of novel markers for cag PAI-positive strains. Innfect Immun 73:3794-3798 doi: 10.1128/IAI.73.6.3794-3798.2005 Thangaraj K, Sridhar V, Kivisild T, Reddy AG, Chaubey G, Singh VK, Kaur S, Agarawal P, Rai A, Gupta J, Mallick CB, Kumar N, Velavan TP, Suganthan R, Udaykumar D, Kumar R, Mishra R, Khan A, Annapurna C, Singh L (2005) Different population histories of the Mundari- and Mon-Khmer-speaking Austro-Asiatic tribes inferred from the mtDNA 9-bp deletion/insertion polymorphism in Indian populations. Hum Genet, 116: 507-517 Trivedi R, Sanghamitra S, Amika S, Bindu GH, Banerjee J, Tandon M, Gaikwad S, Rajkumar R, Sitalaximi T, Richa, Chainy GBN, Kashyap VK (2008) Genetic imprints of Pleistocene origin of Indian populations: A comprehensive phylogeographic sketch of Indian Y- Chromosomes. Int J Hum Genet 8:97-118 Via M, Gignoux CR, Roth LA, Fejerman L, Galanter J, Choudhry S, Toro-Labrador G, Viera-Vera J, Oleksyk TK, Beckman K, Ziv E, Risch N, Burchard Eg, Martinez-Cruzado JC (2011) History shaped the geographic distribution of genic admixture on the island of Puerto Rico. PLOS One 6:16513 doi: 10.1371/journal.pone.0016513 Wirth T, Wang XY, Linz B, Novick RP, Lum JK, Blaser M, Morelli G, Falush D, Achtman M (2004) Distinguishing human ethnic groups by means of sequences from Helicobacter pylori: Lessons from Ladakh. PNAS 101:4746-4751 doi: 10.1073/pnas.0306629101 Zhivotovsky LA, Underhill PA, Cinnioglu C, Kayser M, Morar B, Kivisild T, Scozzari R, Cruciani F, Destro-Bisol G, Spedini G, Chambers G., Herrera RJ, Yong KK, Gresham D, Tournev I, Feldman MW, Kalaydjieva L (2004) The effective mutation rate at Y chromosome short tandem repeats, with application to human population-divergence time. Am J Hum Genet, 74: 50-61