DOTTORATO DI RICERCA in Scienze Della Terra, Della Vita E Dell'ambiente
Total Page:16
File Type:pdf, Size:1020Kb
Alma Mater Studiorum – Università di Bologna DOTTORATO DI RICERCA IN Scienze della terra, della vita e dell’ambiente Ciclo XXX Settore Concorsuale: 05/B1 - Zoologia e Antropologia Settore Scientifico Disciplinare: BIO/08 - Antropologia Unraveling the combined effects of demography and natural selection in shaping the genomic background of Southern Himalayan populations Presentata da Guido Alberto Gnecchi Ruscone Coordinatore Dottorato Supervisore Prof. Giulio Viola Dott. Marco Sazzini Esame finale anno 2018 Table of contents Abstract 1 1. Introduction 3 1.1 Molecular Anthropology and the study of the genetic bases of human biodiversity 3 1.1.1 Studies based on uniparental genetic markers 3 1.1.2 Studies based on autosomal genome-wide markers 5 1.2 The Southern route hypothesis and the first peopling of the Asian continent 7 1.3 The genetic history of South Asian populations 8 1.4 The peopling of East Asia: southern and northern routes 11 1.5 Historical migrations in East Asia 13 1.6 The spread of Tibeto-Burman populations 15 1.7 The peopling of the Tibetan Plateau and its adaptive implications 19 1.7.1 Archeological evidence for the peopling of the Tibetan Plateau 19 1.7.2 Genetic evidence for the peopling of the Tibetan Plateau 21 1.7.3 Adaptive implications related to the peopling of the Tibetan Plateau 22 1.8 The Nepalese Tibeto-Burman populations 23 1.9 High altitude adaptation: the Himalayan case study 24 2. Aim of the study 29 3. Materials and Methods 31 3.1 Sampling campaigns and populations studied 31 3.2 Molecular analyses 35 3.2.1 DNA extraction, Sanger sequencing and high-throughput genotyping 35 3.2.2 Whole genome library preparation and massive parallel sequencing 35 3.3 Bioinformatic processing of generated raw data 36 3.3.1 Data curation on uniparental and genome-wide genotype data 36 3.3.2 Whole genome pair-end reads alignment and variant calling 40 3.3.3 Data curation on whole genome sequence data 42 3.4 Population genomics analyses 43 3.4.1 Genotype-based population structure analyses 43 3.4.2 Tests aimed at providing demographic inferences 43 3.4.3 Haplotype-based estimates of ancestry proportions and magnitude of drift 44 3.4.4 Local ancestry inference 45 3.4.5 Population structure analyses based on whole genome sequence data 46 3.4.6 Fine structure clustering analysis based on whole genome sequence data 46 3.5 Detection of genomic signatures of positive selection 47 3.5.1 Selection scans on SNP-chip genotype data 47 3.5.2 Selection scans on whole genome sequence data 50 3.5.2.1 nSL computation 50 3.5.2.2 DIND computation 52 3.5.3 Gene network analyses 52 4. Results 55 4.1 Population structure analyses and demographic inferences 55 4.1.1 Setting GCA populations into the South/East Asian genomic landscape 55 4.1.2 Complex admixture patterns of GCA and Tibeto-Burman populations 58 4.1.3 Disentangling the impact of admixture and drift on the history of Sherpa people 62 4.1.4 Genomic relationships between GCA and South/East Asian populations 69 4.1.5 The admixed phylogeny of Tibeto-Burman populations 72 4.1.6 Isolating the East Asian ancestry of Tibeto-Burman populations 74 4.1.7 Assessing representativeness of Sherpa and Tibetan whole genome sequence data 77 4.1.8 Identifying homogeneous Tibetan/Sherpa genetic clusters 81 4.2 Dissecting the role of positive selection in shaping the Himalayan genomes 83 4.2.1 Genome-wide signatures of positive selection in GCA Tibeto-Burmans 83 4.2.2 Fine-mapping of functional pathways involved in high-altitude adaptation 85 5. Discussion 93 5.1 Dissecting the GCA genomic landscape 94 5.2 Unraveling the genetic legacy of Tibeto-Burman populations 96 5.3 Divergent adaptive evolution of Sherpa and Tamang populations 98 5.4 Fine-mapping of functional pathways involved in high-altitude adaptation 99 5.5 Conclusions 105 Acknowledgments 107 References 109 Appendix 123 Abstract The populations inhabiting the high-altitude Himalayan valleys and the Tibetan plateau represent an exceptional case of human adaptation to a challenging environment having evolved multifaceted physiological adjustments that allow them to cope with hypobaric hypoxia. Recently, several studies shed new light into the ancestry of populations inhabiting the northern regions of the Tibetan plateau, partially elucidating also the genetic bases of their high-altitude adaptation. Nevertheless, the polygenic nature of such an adaptive phenotype, while being previously hypothesized, has not been formally tested so far. Moreover, less attention has been devoted to the study of populations from the Southern slopes of the Himalayas and to the history of migrations, admixture and/or isolation of the many non-Tibetan trans-Himalayan Tibeto-Burman speaking populations. In the present study, we examined genome-wide variation of previously unsurveyed Tibeto-Burman (i.e. Sherpa and Tamangs) and Indo-Aryan communities from remote Nepalese valleys along with literature data for many South/East Asian populations. Our analyses showed that most of Southern Himalayan Tibeto-Burmans derived their East Asian ancestry not from the Tibetan/Sherpa lineage, but from low-altitude ancestors who plausibly migrated across Northeast India/Myanmar, having experienced extensive admixture that reshuffled the ancestral Tibeto-Burman gene pool. These demographic inferences were also confirmed by the absence in Tamangs of the classical Tibetan/Sherpa “hard sweep” signatures at the high-altitude associated EPAS1 and EGLN1 genes. Finally, by generating a 20X whole genome sequence of a Sherpa individual and by merging it with published whole genome sequence data from Sherpa and Tibetan subjects, we applied an innovative gene network-based pipeline for the detection of signatures of positive selection. This approach enabled us to identify possible signals of polygenic adaptation occurred at the level of gene subnetworks belonging to functional pathways involved in controlling angiogenesis, thus expanding the knowledge about the genetic determinants underlying the complex Tibetan/Sherpa adaptive phenotype. 1 2 1. Introduction 1.1 Molecular Anthropology and the study of human biodiversity Molecular Anthropology, as the name suggests, is a discipline that conjugates anthropological approaches to the theoretical and methodological frameworks proper of Molecular Biology and, especially, Population Genetics/Genomics to study the natural history and evolution of modern humans (Homo sapiens) and their relative extinct species (e.g. Homo neanderthalensis) (reviewed in Destro-Bisol et al. 2010). In the 1960s, Cavalli-Sforza and colleagues were among the first scholars who proposed that human evolution could be studied by depicting the biological diversity of extant human populations, thus planting the seeds for this research field (reviewed in Destro-Bisol et al. 2010). Accordingly, the firsts Molecular Anthropology studies were based on the analysis of the so-called “classical markers” (i.e. proteins), such as the ABO, Rh, and Duffy antigens that determine blood groups, as well as the iso-enzymes of the erythrocytes (reviewed in Cavalli-Sforza et al. 1994). 1.1.1 Studies based on uniparental genetic markers The development of modern, DNA-based, Molecular Anthropology approaches was then made possible by the introduction of Polymerase Chain Reaction (PCR) in the 1980s, which indeed revolutionized the whole field of Molecular Biology. In fact, direct analysis of DNA sequences on one side allowed to increase the number of independent genetic markers that could be studied simultaneously, on the other allowed to discriminate between neutral evolving (e.g. mainly non-coding non-regulatory DNA) and potentially non-neutral evolving loci (e.g. coding or regulatory DNA). The latter category of genomic regions was essential to distinguish for the first time patterns of genetic variability caused by random allele frequency changes (i.e. genetic drift) due to specific population demographic events from those affected by the action of natural selection. Accordingly, hereafter as “genetic marker” we will adopt the definition reported by Jobling et al. 2014, which defines it as a “region of the genome characterized by known patterns of variability and that presents two or more allelic variants whose frequencies are different in diverse human populations”. In detail, the first modern Molecular Anthropology studies focused on the direct analysis of DNA sequences were based only on a limited fraction of the human genome: the uniparentally-inherited 3 genetic systems (i.e. the non-recombining region of the Y-chromosome, NYR and the mitochondrial DNA, mtDNA). Markers located on these genomic regions were used to reconstruct their phylogeny and to trace the routes of migrations and admixture that characterized the dispersal of human populations all over the world. Although they represent only a small portion of the whole human genome, these sequences have the great advantage of not being subjected to recombination during meiosis. This allows researchers to reconstruct accurate phylogenies of genetic lineages that are inherited intact from the father to sons (Y-chromosome) or from the mother to the entire offspring (mtDNA), also identifying the occasional variations caused by mutations, which represent one of the main sources of genetic diversity (Fig. 1.1). Figure 1.1. Mechanism of inheritance of autosomal and uniparental genetic markers (Butler 2005). Studies based on uniparental markers were essential to reconstruct human evolution and to gain insights into many aspect of human history. One of the most outstanding discoveries that have been made by studying uniparental markers is that Homo Sapiens is actually a recent species in terms of evolutionary times, who originated ~200 thousand years ago in East Africa, and that since its spread out of Africa gene flow between neighboring groups or even more extensive admixture events between distant populations have often occurred throughout its history. For these reasons, the overall intra-specific genetic diversity of our species is relatively low when compared to that reported for other mammals.