1 Tracing the Biogeographical Origin of South Asian
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/089466; this version posted November 25, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Tracing the biogeographical origin of South Asian populations using DNA SatNav Ranajit Das1* and Priyanka Upadhyai2 Affiliations: 1. Manipal Centre for Natural Sciences (MCNS), Manipal University, Manipal, Karnataka, India 2. Department of Medical Genetics, Kasturba Medical College, Manipal University, Manipal, Karnataka, India *Address for correspondence: Ranajit Das Manipal Academy of Higher Education (MAHE) Manipal University University building, Lab 12 Madhav Nagar Manipal-576104 Karnataka, India Phone: +918582802871 Email: [email protected] Both authors contributed equally to this work 1 bioRxiv preprint doi: https://doi.org/10.1101/089466; this version posted November 25, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Abstract The Indian subcontinent includes India, Bangladesh, Pakistan, Nepal, Bhutan, and Sri Lanka that collectively share common anthropological and cultural roots. Given the enigmatic population structure, complex history and genetic heterogeneity of populations from this region, their biogeographical origin and history remain a fascinating question. In this study we carried out an in-depth genetic comparison of the five South Asian populations available in the 1000 Genomes Project, namely Gujarati Indians from Houston, Texas (GIH), Punjabis from Lahore (PJL), Indian Telugus from UK (ITU), Sri Lankan Tamils from UK (STU) and Bengalis from Bangladesh (BEB), tracing their putative biogeographical origin using a DNA SatNav algorithm - Geographical Population Structure (GPS). GPS positioned >70% of GIH and PJL genomes in North India and >80% of ITU and STU samples in South India. All South Asian genomes appeared to be assigned with reasonable accuracy, along trade routes that thrived in the ancient Mauryan Empire, which had played a significant role in unifying the Indian subcontinent and in the process brought the ancient North and South Indian populations in close proximity, promoting admixture between them, ~2300 years before present (YBP). Our findings suggest that the genetic admixture between ancient North and South Indian populations likely first occurred along the Godavari and Krishna river basin in Central-South India. Finally our biogeographical analyses provide critical insights into the population history and sociocultural forces driving migration patterns that may have been instrumental in shaping the population structure of the Indian subcontinent. 2 bioRxiv preprint doi: https://doi.org/10.1101/089466; this version posted November 25, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Introduction India is the seventh largest country in the world in terms of land area; it lies in the Indian subcontinent that also includes its neighboring countries, Bangladesh, Pakistan, Nepal, Bhutan and Sri Lanka, which collectively share common anthropological and cultural roots. India alone is the second most populous country in the world with more than 4000 anthropologically distinct groups, 22 languages and several other regional dialects.1,2 The Indian subcontinent is unique in its complex history, diversity and eclectic population structure, thus, the geographic origins of populations from this region remain a challenging question in genetics, cultural anthropology and history. The complex tapestry of genetic and cultural diversity observed in contemporary India is a result of admixture between its autochthonous and immigrant gene-pools. Several lines of evidence have suggested that by 50,000 to 20,000 years before present (YBP) modern humans had migrated to various parts of India and its surrounding areas in South Asia.3,4 Linguistic and genetic evidences suggest that this was followed by the early migration of putative proto-Dravidian language speakers, ~ 6000 YBP from West Asia (Elam province), eventually giving rise to the Dravidian speaking, South Indian populations.5-7 The ancient peopling of India was followed by several more recent waves of immigration from West and East Eurasia (Central Asia, the Middle East, Europe and the Caucasus) by populations that spoke Indo-European languages and who are often presumed to be ‘Aryans’.8-11 The migration of Eurasian populations into the Indian sub-continent has often been ascribed to purported Aryan invasions and/or the spread of agriculture.12-15 The later Eurasian immigrants diffused throughout the Indian subcontinent and are believed to have either admixed with the existing Indian 3 bioRxiv preprint doi: https://doi.org/10.1101/089466; this version posted November 25, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. populations, primarily contributing to the North Indian gene-pool.7,16 Several lines of evidence based on analyses of Y chromosome, mitochondrial DNA9,17-19 and whole- genome approaches10,20-22 have indicated that the modern populations of India are descendants of two genetically divergent ancestral populations, the Indo-European language speaking Ancestral North Indians (ANI) and the Dravidian speaking Ancestral South Indians (ASI). Genome-wide analyses of individuals from well-defined ethno-linguistic groups of South Asia estimated that the admixture between ancestral North and South Indian populations likely occurred ~1900 to 4200 YBP.23 Despite the high admixture between ancestral Indian populations, the ensuing sociocultural stratification resulted in more than 400 tribal and 4000 caste groups, that are endogamous, in present day India.24 Strictly enforced sociocultural norms barred mate-exchange between individuals belonging to distinct strata of the society and resulted in consanguinity being as high as 20-30% in some Indian populations, leading to the possibility of founder mutations and a high prevalence of recessive inherited diseases in these groups.22,25 Thus, population stratification and constrained gene-flow between distinct groups altered the demography and genetic structure of populations from the Indian subcontinent. Several studies have used high-resolution genome-wide approaches to assess the genetic heterogeneity and diversity of South Asian populations. The Indian Genome variation consortium (2008) analyzed 405 single nucleotide polymorphisms (SNPs) from 75 genes in 55 endogamous Indian populations and observed high levels of genetic divergence between them based on ethnicity and language.2 Reich et al. (2009) investigated the complex genetic canvas of India using a much larger sample size, 4 bioRxiv preprint doi: https://doi.org/10.1101/089466; this version posted November 25, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. wherein 560,123 SNPs were surveyed in 132 Indians originating from 25 ethno- linguistically and socially distinctive groups across the country.22 Another study compared 85 Gujaratis or North Indian genomes from Phase 3 of the International HapMap project to 83 South Asian Indians from Singapore to determine recent positive selection on the SLC24A5 gene that plays a role in skin pigmentation, suggesting a higher degree of genetic closeness between Gujaratis and Europeans.26 In the present study we have carried out an in-depth genetic comparison of the five South Asian populations available in 1000 Genomes project, namely Gujarati Indians from Houston, Texas, USA (GIH), Punjabis from Lahore, Pakistan (PJL), Indian Telugu from UK (ITU), Sri Lankan Tamil from UK (STU), and Bengalis from Bangladesh (BEB), assessing a total of 489 (103 GIH, 96 PJL, 102 ITU, 102 STU and 86 BEB) samples.27 We also investigated several previously reported Indian populations from distinct linguistic groups (N = 378), obtained from the Reich lab at Harvard Medical School, USA.23 Overall we evaluated 494,863 previously assessed SNPs23 to interrogate the genetic diversity among the above mentioned five South Asian populations. Given the esoteric population history and profound genetic diversity in the Indian subcontinent, questions regarding their origin remain particularly fascinating. To facilitate our understanding of the biogeographic origin of South Asian populations, we employed an admixture-based method, known as Geographical Population Structure (GPS)28, which has been successfully applied to trace the accurate origin of various modern populations including Yiddish language speakers29 and the Mountain Druze population30. Materials and Methods 5 bioRxiv preprint doi: https://doi.org/10.1101/089466; this version posted November 25, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Datasets The dataset used in the present study was obtained from 1000 Genomes project phase 3 and from the Reich Lab, Department of Genetics, Harvard Medical School, USA.21,23 The original data obtained from the Reich lab was in the eigenstrat format and was converted into the PLINK bed format using EIG v4.2.31