Adaptive History of the Chimpanzee Subspecies in the Genomic Era Jessica Nye TESI DOCTORAL UPF 2018 Directors de la Tesi Jaume Bertranpetit i Hafid Laayouni CIÈNCIES EXPERIMENTALS I DE LA SALUT ii Acknowledgements I would like to thank my advisors Dr. Jaume Bertranpetit and Dr. Hafid Laayouni for guiding me during my doctoral research. I would like to thank my family and friends for supporting me during my many years of education. To my fellow lab mates, I am grateful for the theoretical discussions we had over the last four years. This work was supported by FI Agaur grant FI-DGR 2015. iii iv Abstract Chimpanzees are our closest living genetic cousins. The species consists of four subspecies, each with a unique demographic history. In order to better understand their species, a detailed map of signatures of selection will highlight important genetic differences between the four subspecies. Here, we interrogate the genome using 15 tests of selection. We simulate neutral and selective scenarios taking their unique demographic history into account in the model. We combine all these elements using a machine learning approach to highlight regions of interest in each subspecies. We specifically investigate the regions of the genome that have been selected after introgression with bonobo. We investigate the haplotype changes that have occurred since the divergence with humans in a few specially selected genes. From this, we find that the evolutionary history of each of the four subspecies is unique. That subtle differences in their demographic history and environment have greatly shaped their genetic diversity. In general, we observe signatures in selection in phenotypes involving immunity, muscle function, reproduction, and DNA repair. This dissertation highlights the important genes and regions that have resulted in the divergence between subspecies of chimpanzee. v vi Resum Els ximpanzés són els nostres cosins genètics més propers. L'espècie consta de quatre subespècies, cadascuna amb una història demogràfica única. Per comprendre millor la diversitat dels ximpanzés hem elaborat un mapa detallat d’empremtes de selecció presents en el genoma que ressaltarà importants diferències evolutives entre les quatre subespècies. En aquest treball, interroguem el genoma amb 15 tests de selecció diferents i simulem escenaris neutres i selectius tenint en compte la història demogràfica única de cada subespècie. A través d’un procés de machine learning, hem combinat els resultats de tots aquests tests utilitzant un enfocament d'aprenentatge automàtic per reconèixer regions d'interès en cada subespècie, que són les que mostren clarament els rastres de la selecció positiva. Hem investigat específicament les regions del genoma que han estat seleccionades després de la introgressió amb el bonobo, donat que la selecció després de la introgressió és una àrea nova molt interessant. Pel conjunt del genoma, hem investigat els canvis que s'han produït des de la divergència amb els humans en gens especialment sotmesos a selecció positiva, obtenint un panorama de general adaptatiu en cada subespècie. A partir d'aquí, trobem que la història evolutiva de cadascuna de les quatre subespècies és única. Les subtils diferències entre subespècies en la seva història demogràfica i el seu entorn han donat forma a la seva diversitat genètica adaptativa. En general, observem empremtes de selecció en gens que informen fenotips que inclouen immunitat, funció muscular, reproducció i reparació d'ADN. Aquesta tesi destaca els gens i regions importants que han donat lloc a la divergència adaptativa entre subespècies de ximpanzé. vii viii Preface Humans have had an interest in the great apes for many decades. However, despite decades long observational studies, their genetic diversity is only just beginning to be understood. Due to the lack of a fossil record and difficult logistics of sample collection, few genome-wide multi-species studies exist in the great apes. Furthermore, the computational power needed to adequately compare such large datasets have only become readily available in recent decades. Simulations of a neutral expectation are needed so that observations can be more precisely calculated. While it has been possible to simulate simple scenarios, only recently possible to reasonably simulate an accurate demographic history, or to implement theoretical selective scenarios in which to compare to real data. The increase in computational power has also led to the development of machine learning approaches. A machine learning approach allows for the combination of any number of characteristics in order to draw a single conclusion. This doctoral dissertation takes advantage of a newly available genetic data consisting of genome-wide sequences of chimpanzees and bonobos. It takes advantage of available computational power to simulate differing scenarios of evolutionary history of the chimpanzee. It combines these two aspects using a machine learning approach in order to mine the genome. Without these recent advances in both genetic data collection and computational tools, this type of study would be impossible. ix x List of Publications Nye J, Laayouni H, Kuhlwilm M, Mondal M, Marques-Bonet T, Bertranpetit J. Selection in the Introgressed Regions of the Chimpanzee Genome. Genome Biology and Evolution. 2018. 10(4):1132-1138 Nye J, Mondal M, Bertranpetit J, Laayouni H. A Fully Integrated Scan of Selection in the Chimpanzee Genome. Journal of Evolutionary Biology. Submitted. Villegas P, Nye J, Bertranpetit J, Laayouni H. Screening for positive selection signals on the X chromosome in human populations. Genome Biology and Evolution. Submitted. xi xii Table of Contents Acknowledgements iii Abstract iv Resum vii Preface ix List of Publications xi 1. Introduction 1 1.1 A Brief History of Population Genetics 3 1.2 Tests of Selection 6 a. Site Frequency Spectrum-based tests 6 b. Linkage Disequilibrium-based tests 8 c. Population Differentiation-based tests 9 1.3 Demography 10 a. Population History 10 b. Simulating Demography 12 1.4 Machine Learning 13 1.5 Chimpanzees 15 2. Fully Integrated Scan of Selection in the 25 Chimpanzee Genome 2.1 Introduction 27 2.2 New Approaches 28 2.3 Materials and Methods 29 a. Genome Sequences 29 b. Demographic Model 30 c. Simulations 30 d. Statistical Tests 31 e. Random Forest Algorithm 32 f. Genes Under Selection 33 2.4 Results 33 a. Regions identified using a Random 33 Forest approach b. Genes commonly under selection 36 in more than two subspecies xiii c. Genes commonly under selection 38 in two subspecies d. Genes uniquely under selection 40 in a specific subspecies 2.5 Discussion 44 2.6 Tables 47 2.7 Supplementary Information 50 a. Phasing 50 b. Demography 50 c. Random Forest algorithm 57 2.8 Supplementary Figures 59 2.9 Supplementary Tables 106 3. Selection in the Introgressed Regions 153 of the Chimpanzee Genome 3.1 Introduction 155 3.2 Materials and Methods 156 a. Genome sequence with signatures 156 of introgression b. Selection tests 156 c. Comparison with whole genome 157 d. Tissue enrichment 157 3.3 Results 159 a. Comparing introgressed regions to 159 the whole genome b. Selection scan of introgressed regions 161 c. Adaptive selection 164 d. Adaptive introgression 169 e. Long term balancing selection 170 3.4 Discussion 171 3.5 Tables 175 4. Discussion 203 4.1 Discussion 205 a. General Dissertation Review 205 b. Impacts of Demography 206 c. Targets of Selection 208 xiv d. Further Considerations 210 4.2 Community Resource 211 5. References 215 xv xvi 1. Introduction 1 2 1.1 A Brief History of Population Genetics When Charles Darwin published his theory of Evolution by Natural Selection in 1859, he posited that if a variable trait were heritable and some version of the trait donates an advantage, then the beneficial copy will increase in frequency over time. When it was published, no one was aware of the source of a trait’s variability. After the rediscovery of Mendel’s 1866 work on heritability in pea plants, research into discrete heritable units began at the turn of the 20th century. The field of genetics and genomics has advanced greatly in the past hundred years, to the point where we understand much of the coding variation in a genome. Variation in a genome is understood to be from three major sources. Single nucleotide polymorphisms (SNPs) are the changes to the genetic code due to mutations. Next, small insertions and deletions or microsatellites may donate small neighborhood changes. Lastly, larger sources of variability can come from chromosomal inversions and copy number variation which can have larger impacts on entire regions of a chromosome. There have been many studies into untangling the biological impact of all these types of variation within coding regions; however, a large gap in our knowledge of genetics is into the impact of variation in noncoding regions. In modern population genetics, more variables have been added to Darwin’s theory. The Hardy-Weinberg equation was published separately in 1908, and demonstrates that allele frequencies will remain at a constant equilibrium over time in the absence of evolutionary forces. While this equation still is much used to this day, it contains several unrealistic assumptions, such as random mating or infinite population size. 3 The next major iteration in population genetic theory came when Fisher in 1930 and Wright in 1931 published models that included finite population sizes, allowing the evolutionary force of drift to be defined. According to what is now known as the Wright-Fisher model, the size of the population is directly proportional to the chance that an allele will become fixed or disappear from the population, purely due to sampling. Later, in 1955 and 1956, Kimura added the two-allele case to the Wright-Fisher model and posited, in 1983, that almost all variation in the genome is neutral; producing what is now known as the Neutral Theory of Evolution.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages196 Page
-
File Size-