Large-Scale Evolutionary Analysis of Polymorphic Inversions in the Human Genome
Total Page:16
File Type:pdf, Size:1020Kb
ADVERTIMENT. Lʼaccés als continguts dʼaquesta tesi queda condicionat a lʼacceptació de les condicions dʼús establertes per la següent llicència Creative Commons: http://cat.creativecommons.org/?page_id=184 ADVERTENCIA. El acceso a los contenidos de esta tesis queda condicionado a la aceptación de las condiciones de uso establecidas por la siguiente licencia Creative Commons: http://es.creativecommons.org/blog/licencias/ WARNING. The access to the contents of this doctoral thesis it is limited to the acceptance of the use conditions set by the following Creative Commons license: https://creativecommons.org/licenses/?lang=en Doctoral thesis Large-scale evolutionary analysis of polymorphic inversions in the human genome Author Carla Giner Delgado Director Mario Caceres´ Aguilar Departament de Gen`eticai de Microbiologia Facultat de Bioci`encies Universitat Aut`onomade Barcelona 2017 Large-scale evolutionary analysis of polymorphic inversions in the human genome Mem`oriapresentada per Carla Giner Delgado per a optar al grau de Doctora en Gen`etica per la Universitat Aut`onomade Barcelona Autora Director Carla Giner Delgado Mario Caceres´ Aguilar Bellaterra, 28 de Setembre de 2017 Abstract Chromosomal inversions are structural variants that invert a fragment of the genome without usually modifying its content, and their subtle but powerful effects in natural populations have fascinated evolutionary biologists for a long time. Discovered a century ago in fruit flies, their association with dif- ferent evolutionary processes, such as local adaptation and speciation, was soon evident in several species. However, in the current era of genomics and big data, inversions frequently escape the grasp of current technologies and remain largely overlooked in humans. During the last few years, the InvFEST Project has aimed to address the missing knowledge about human inversions by validating and genotyping a large fraction of predicted poly- morphisms. In particular, it has generated one of the most useful data sets on human inversions, consisting of 45 common inversions (with sizes from 83 bp to 415 kbp) genotyped at high-quality in 550 individuals of seven populations of diverse ancestry. This thesis takes advantage of the available population-scale information, combined with whole-genome sequences avail- able from the 1000 Genomes Project, to carry out the first detailed analysis of the evolutionary properties of human polymorphic inversions. The meth- ods used combine theoretical models, simulations and empirical comparisons with other mutation types. Besides the complete characterization of the data set, the results confirm fundamental differences between inversions created by different mechanisms. The frequency distribution of the 21 inversions originated by non-homologous mechanisms (NH) is similar to that expected for neutral variants when controlling for detection biases, which indicates that they are not subjected to strong negative selection. Recombination is completely inhibited across the whole inversion length, with no clear ge- netic exchange found, and possibly over a few kbp beyond the breakpoints. As a result, NH inversions strongly affect local genome variation levels, as predicted by computer simulations, with older inversions increasing total nucleotide diversity, while younger ones at very high frequency could have the opposite effect. In contrast, most inversions created by non-allelic ho- mologous recombination (NAHR) (19/24) have appeared independently in different haplotypes in the sample. These high recurrence levels are reflected in several measures: they are enriched in intermediate frequencies, share mul- tiple nucleotide polymorphisms between orientations, and have little linkage disequilibrium with neighbouring variants, which limits their detection by tag SNP strategies. Finally, in order to find inversions that are functional candidates, different signatures of selection on inversions were explored based on their frequencies, population differentiation and sequence variation pat- terns. Ten candidates were revealed, with three of them found to be >1.5 million years old and maintained at intermediate frequencies, possibly by II balancing selection. One of these was also found in archaic hominins. Other candidates seem to have reached high frequencies in a short period of time in some populations, consistent with positive selection. Notably, over half of the candidates are located within gene regions, which suggests that they may have functional effects. Thus, this work offers an overview of inversion dynamics and their role as genomic modifiers, opening interesting avenues of investigation. Contents 1 Introduction 1 1.1 Human genome hidden complexity . 1 1.1.1 Types of genomic variation . 1 1.1.2 Detection of structural variants . 4 1.1.3 Mutational mechanisms . 8 1.2 Inversions: a special mutation type . 9 1.2.1 Inhibition of recombination . 10 1.2.2 Effect on nucleotide diversity patterns . 15 1.2.3 Evolutionary importance . 16 1.3 Inference of evolutionary history . 19 1.3.1 Human evolutionary history . 19 1.3.2 Detecting selection in humans . 21 1.3.3 Neutrality tests applied to human inversions . 27 1.4 The InvFEST Project . 29 1.5 Objectives . 31 2 Materials and methods 33 2.1 Data set description . 33 2.1.1 Origin of the studied inversions . 33 III IV CONTENTS 2.1.2 Available inversion annotations . 35 2.1.3 Genotyping panel . 36 2.1.4 Experimental methods overview . 39 2.2 Characterization of the inversion data set . 41 2.2.1 Improvement of inversion annotation . 41 2.2.2 Simulation of inversion frequency ascertainment bias 42 2.2.3 Frequency estimates and population distribution . 48 2.2.4 Comparison to 1000GP inversion data . 48 2.2.5 Tag variant analysis . 49 2.3 Inversion frequency patterns . 49 2.4 Effect on local nucleotide diversity . 50 2.4.1 Inversion simulation . 50 2.4.2 Inversion nucleotide diversity . 51 2.5 Linked variation and recombination . 52 2.5.1 Generation of same-frequency SNP data set . 52 2.5.2 Linkage disequilibrium patterns . 53 2.5.3 Linked variant classification . 53 2.5.4 Definition of non-recombining regions . 55 2.6 Evolutionary history inference . 56 2.6.1 Age estimate from divergence . 56 2.6.2 Haplotype analysis . 59 3 Results 61 3.1 Inversion data set characterization . 63 3.1.1 Inversion annotation . 63 3.1.2 Inversion frequencies . 68 CONTENTS V 3.1.3 Comparison with 1000GP inversion data set . 72 3.1.4 Inversion tag variants . 75 3.2 Analysis of inversion frequencies . 81 3.2.1 Inversion allele frequency spectrum . 81 3.2.2 Population differentiation . 84 3.3 Effect on local nucleotide diversity . 89 3.3.1 Inversion simulation . 89 3.3.2 Observed values . 90 3.3.3 Neutrality tests . 93 3.4 Linked variation and recombination . 97 3.4.1 Linkage disequilibrium patterns . 97 3.4.2 Types of linked variants . 99 3.4.3 Recombination outside the breakpoints . 104 3.5 Inversion history and dynamics . 109 3.5.1 Inversion age . 109 3.5.2 History reconstruction from haplotypes . 112 3.6 Inversions under selection . 119 3.6.1 Signatures of balancing selection . 119 3.6.2 Signatures of positive selection . 121 4 Discussion 123 4.1 Human polymorphic inversions . 124 4.1.1 Inversion frequency . 124 4.1.2 Inversion size . 125 4.1.3 Mechanisms of formation . 126 4.2 Inversion effect on fertility . 127 VI CONTENTS 4.3 Recombination inhibition . 129 4.3.1 Limitations of low-coverage sequencing data . 129 4.3.2 Genetic flux measure . 131 4.3.3 Recombination past the breakpoints . 132 4.3.4 Additional resources . 134 4.4 Recurrence . 135 4.4.1 Recurrence determinants . 136 4.5 Functional effects and selection . 139 4.5.1 Functional candidates . 139 4.5.2 Signatures of selection . 140 4.5.3 Neutrality tests for inversions . 142 5 Conclusions 145 Bibliography 147 A Supplementary figures 163 B Supplementary tables 177 List of Figures 1.1 Types of genomic variation . 2 1.2 Paired-end mapping signatures . 5 1.3 Models of inhibition of recombination . 12 1.4 Human evolutionary history . 20 1.5 Overview of InvFEST study to genotype and characterize common inversion polymorphisms in humans . 30 2.1 Characteristics of the nine fosmid libraries used in the PEM analysis . 34 2.2 Location of the 45 genotyped inversions in the human genome 36 2.3 Size and breakpoint complexity of the 45 genotyped inversions 37 2.4 Genotyped individuals also in 1000GP . 39 2.5 Experimental genotyping methods . 40 2.6 Size and breakpoint characteristics of predicted inversions . 44 2.7 Probability of inversion detection by paired-end mapping . 46 2.8 Performance of age estimator in simulations . 57 2.9 Local substitution rate estimates . 58 3.1 Unclear dotplot alignments . 64 3.2 Inversion orientation in different primates . 65 3.3 Overview of the detection and selection of analysed inversions 69 VII VIII LIST OF FIGURES 3.4 Estimated proportion of variants missed in each step . 70 3.5 Inversion frequency overview . 71 3.6 Inverted sequence per individual compared to HG18 . 72 3.7 Cumulative length of inversions in heterozygosis in the geno- typed samples . 73 3.8 Genotype and frequency comparison with 1000GP inversions 74 3.9 Inversions with tag variants . 76 3.10 Tag variants at population and super-population level . 77 3.11 Commercial array coverage of inversion tag SNPs . 80 3.12 Observed and expected frequency distribution of inversions . 83 3.13 Inversion population differentiation levels . 85 3.14 Effect of inversions on nucleotide diversity levels . 91 3.15 Inversion effect on Tajima's D . 96 3.16 Maximum r2 with nearby variants . 98 3.17 Linkage disequilibrium patterns . 101 3.18 Physical position of different classes of variants linked to in- versions . 103 3.19 Changes in variant type proportion close to inversions . 105 3.20 Error in age estimate . 110 3.21 Inversion age estimates . 111 3.22 Inversion haplotype examples . 114 3.23 Recurrence in inversion HsInv0832 .