Population Genomics in Drosophila Melanogaster: a Bioinformatics Approach
Total Page:16
File Type:pdf, Size:1020Kb
ADVERTIMENT. Lʼaccés als continguts dʼaquesta tesi queda condicionat a lʼacceptació de les condicions dʼús establertes per la següent llicència Creative Commons: http://cat.creativecommons.org/?page_id=184 ADVERTENCIA. El acceso a los contenidos de esta tesis queda condicionado a la aceptación de las condiciones de uso establecidas por la siguiente licencia Creative Commons: http://es.creativecommons.org/blog/licencias/ WARNING. The access to the contents of this doctoral thesis it is limited to the acceptance of the use conditions set by the following Creative Commons license: https://creativecommons.org/licenses/?lang=en DOCTORAL THESIS Population genomics in Drosophila melanogaster: a bioinformatics approach Author Sergi Herv´asFern´andez Supervisors Dr. Antonio Barbadilla Prados Dr. S`onia Casillas Viladerrams Departament de Gen`eticai de Microbiologia Facultat de Bioci`encies Universitat Aut`onomade Barcelona 2018 Population genomics in Drosophila melanogaster: a bioinformatics approach Gen´omicade poblaciones en Drosophila melanogaster: una aproximaci´onbioinform´atica Gen`omicade poblacions a Drosophila melanogaster: una aproximaci´obioinform`atica Mem`oriapresentada per Sergi Herv´asFern´andez per a optar al grau de Doctor en Gen`eticaper la Universitat Aut`onomade Barcelona Sergi Herv´as Dr. Antonio Dra. S`oniaCasillas Fern´andez Barbadilla Prados Viladerrams Autor Director i Tutor Directora Bellaterra, 19 de Setembre de 2018 Abstract High-throughput sequencing technologies are allowing the description of genome-wide variation patterns for an ever-growing number of organ- isms. However, we still lack a thorough comprehension of the relative amount of different types of genetic variation, their phenotypic effects, and the detection and quantification of distinct selection regimes act- ing on genomes. The recent compilation of more than one thousand of worldwide wild-derived Drosophila melanogaster genome sequences reassembled using a standardized pipeline (Drosophila Genome Nexus, DGN, Lack et al. 2015, 2016) provides a unique resource to test molec- ular population genetics hypotheses, and ultimately understand the evolutionary dynamics of genetic variation in the populations. Besides, the increasing amount of genomic data available requires the continuous development and optimization of bioinformatics tools able to handle and analyze such information. Thus, the development and implementation of new biologically-oriented software addressing several steps from data acquisition, filtering, processing, display or analysis to the final reporting step is a constantly growing need, especially in fields dealing with large data sets, such as population genomics. This thesis is conceived as a comprehensive bioinformatics and popula- tion genomics project. It is centered in the development and application of bioinformatics tools for the analysis and visualization of nucleotide variation patterns and the detection of selective events in the genome of D. melanogaster, using the DGN data. The main goal is accomplished in three sequential steps: (i) capture the evolutionary properties of the analyzed sequences (i.e., create a catalog of population genetics metrics) and implement a tool for the graphical display of such information; (ii) develop a statistical package for the computation of the diverse selection regimes acting on genomes (positive and purifying selection), and finally (iii) perform an initial population genomics analysis in D. melanogaster using the previously developed tools. The common ap- proach applied to process the data, starting at the assembly of genome sequences and ending up at the estimates of population genetics metrics, allows performing, for the first time, a comprehensive comparison and interpretation of results using samples from five continents. Overall, this work provides a global overview of the nucleotide variation and adaptation patterns along the genome, and a general assessment of the relative impact of the major genomic determinants of genetic variation, in Drosophila meta-populations with different geographical origin. i Table of Contents 1 Introduction 1 1.1 Biological evolution and molecular population genetics 3 1.1.1 Mutation as the ultimate source of genetic variation . 5 1.1.2 The (nearly) neutral theory of molecular evolution . 8 1.1.3 Population dynamics of new mutations and the distribution of fitness effects . 11 1.2 Detecting the genomic footprint of natural selection 14 1.2.1 Determinants of patterns of genetic variation 14 1.2.2 Genome-wide signatures of selection . 19 1.2.3 Tests of selection . 20 1.3 Drosophila as a model organism for population genetics 29 1.3.1 Population genomics projects in Drosophila melanogaster . 30 1.3.2 The Drosophila Genome Nexus sequence data 32 1.4 Bioinformatics of genetics diversity . 34 1.4.1 Genome browsers: graphical biological databases 36 1.4.2 The R project, an standard for statistical genetics analyses . 44 1.5 Objectives . 47 2 Materials and Methods 49 2.1 Data description . 51 2.1.1 D. melanogaster genome sequences . 51 2.1.2 Geographic units of analysis . 51 2.1.3 Functional annotations and outgroup species 53 2.2 Estimation of population genetics parameters . 56 2.2.1 Integration of annotations and sequence genomic data . 56 2.2.2 Windows-based estimates . 58 2.2.3 Genes-based estimates . 65 2.2.4 Statistical analyses . 68 2.3 PopFly: the Drosophila population genomics browser 69 2.3.1 Browser software and interface . 69 2.3.2 Development and implementation of new utilities 70 iii 2.4 iMKT: an R package for the Integrative McDonald and Kreitman Test . 74 2.4.1 Implementation of four McDonald and Kreitman derived tests . 74 2.4.2 Input custom data . 79 2.4.3 Retrieval and analysis of PopFly and PopHuman data . 80 3 Results 81 3.1 PopFly: the Drosophila population genomics browser 83 3.1.1 Browser interface . 86 3.1.2 Utilities and support resources . 89 3.1.3 Testing evolutionary hypotheses using PopFly 94 3.2 iMKT: an R package for the Integrative McDonald and Kreitman Test . 99 3.2.1 Estimators of the integrative MKT . 101 3.2.2 Calculation of MKT-derived methods . 103 3.2.3 iMKT using PopFly and PopHuman data . 108 3.3 Population genomics analyses in Drosophila melanogaster . 115 3.3.1 Genome-wide polymorphism and divergence patterns . 117 3.3.2 The landscape of population historical recombination . 127 3.3.3 Detection of positive and purifying selection in the Drosophila genome . 131 3.3.4 The effect of recombination and coding density in the rates of nucleotide variation and adaptation . 141 4 Discussion 149 4.1 Bioinformatics tools for population genomics . 151 4.1.1 PopFly: the Drosophila population genomics browser . 152 4.1.2 Integrative MKT software . 158 4.2 Population genomics of Drosophila melanogaster . 167 4.2.1 Impact of demography on populations structure168 4.2.2 Genome-wide nucleotide variation and recombination patterns . 170 iv 4.2.3 Prevalence of purifying selection and evidence of adaptive evolution . 176 4.2.4 Genomic determinants of the adaptation rate and the HRi . 180 4.2.5 The faster-X hypothesis . 184 5 Conclusions 187 6 References 193 7 Annexes 211 7.1 Supplementary Tables . 213 7.2 Supplementary Figures . 217 7.3 Article 1. PopFly: the Drosophila population genomics browser . 221 7.4 Article 2. PopHuman: the human population genomics browser . 225 List of Tables 235 List of Figures 237 List of Boxes 239 List of Supplementary Tables 239 List of Supplementary Figures 239 v 1. Introduction 1 1. Introduction 1.1 Biological evolution and molecular population genetics The publication of Charles Darwin's work The Origin of Species (1859) supposed a revolutionary shifting in the way we understand and explain biological diversity. In his book, Darwin first introduced the concept of biological evolution, conceived as a population process where variation among individuals within a population is converted, through its magnification in time and space, in new populations, new species and, by extension, all biological diversity present, or that ever existed, on Earth (Lewontin 1974). Therefore, biological evolution is just the result of the process of change in populations through generations. There are two requirements for evolution to occur: (i) there must be variation in the phenotypic traits between individuals within a population and, (ii) this variation must be inheritable (at least partially) among generations (Lewontin 1970, Endler 1986). In other words, evolution can be defined as a process of statistical transformation of groups of interbreeding individuals that share a common genetic pool which defines their phenotypes (i.e., Mendelian populations, the units of evolution). Thus, populations are defined by the distribution of alleles and genotypes from all their individuals. Phenotypic traits are determined by a combination of the underlying genotypes and environmental pressures. DNA is the molecule that carries the genetic (genotypic) information (Avery et al. 1944). One of its most important properties is that it is intrinsically mutable, meaning that it has the potential to originate new genetic variants which can be accurately replicated and transmitted from generation to generation, providing the substrate on which evolution can occur. These new genetic variants can contribute differentially to the sur- vival or reproductive success of individuals within the populations (fitness differences). If they do so, then natural selection takes place. 3 1. INTRODUCTION Natural selection can be defined as a key process shaping the