Bioinformatic Characterization and Analysis of Polymorphic Inversions in the Human Genome
Total Page:16
File Type:pdf, Size:1020Kb
Bioinformatic characterization and analysis of polymorphic inversions in the human genome Author: Alexander Martínez Fundichely DOCTORAL THESIS UPF = YEAR 2013 Supervisor: Dr. Mario Cáceres Aguilar Tutor: Dr. Xavier Estivill Pallejà Department of Experimental and Health Sciences A thesis submitted for the degree of PhilosophiæDoctor (PhD) in Biomedicine. In the department of Experimental and Health Sciences of the Universitat Pompeu Fabra (UPF). By the doctoral candidate Msc. Alexander Martínez Fundichely Supervisor: Principal investigator of the Institut de Biotecnologia i de Biomedicina (IBB), Universitat Autònoma de Barcelona (UAB), Bellaterra, Barcelona, Spain. And member of the Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain. Dr. Mario Cáceres Aguilar Tutor: Senior investigator at the Center for Genomic Regulation (CRG), Barcelona, Catalonia in Spain. And associate professor of Pompeu Fabra University (UPF) Dr. Xavier Estivill Pallejà Day of the defense: 12=12=2013 To my Mom To my maternal grandma Especially to my missing grandparents and From the bottom of my heart, I wish share this achievement with my Dad (without you physically present, everything is harder than that we talked before, but step by step I’m keeping on the play). ACKNOWLEDGEMENTS First and foremost I offer my sincerest and utmost gratitude to my senior su- pervisor, Dr. Mario Cáceres, who has supported me throughout my graduate studies with his encouragement, guidance and resourcefulness. In addition to his experience in genomics that was a crucial factor in the success of this thesis, I have learned the requirements of being a good researcher from him. I thank him one more time for all his help. I would like to give my regards and appreciations to Dr. Xavier Estivill, this whole story began with his acceptance to my visit to his lab. Starting from the first beginning, I would also like to thank my other collaborators in the CRG but especially thanks to the people in the lab of IBB, a real nice people who offered me friendship and collaborations and all of them without exception have been important in the development of this thesis. Their motivation was a huge driving force for me. Especially thanks to my new family for the love, affection and support they gave me in all these years, that helps me to overcome the sadness of being far away from my loved ones and feel like one more catalan. I would also like to thank all my friends every one are mentioned here al- though not seen the name list,(Cuban, Catalan, Spanish...) their help has been very important in this achievement. Last, but not least, I thank to my mom and grandma, their support even at the distance has been essential to reach this goal, and thank to all the other Cuban family who are interested in me, and always helps my mother when she needs. Alexander Martinez Fundichely v ABSTRACT Within the great interest in the characterization of genomic structural variants (SVs) in the human genome, inversions present unique challenges and have been little studied. This thesis has developed GRIAL, a new algorithm focused specif- ically in detect and map accurately inversions from paired-end mapping (PEM) data, which is the most widely used method to detect SVs. GRIAL is based on geometrical rules to cluster, merge and refine both breakpoints of putative inver- sions. That way, we have been able to predict hundreds of inversions in the human genome. In addition, thanks to the different GRIAL quality scores, we have been able to identify spurious PEM-patterns and their causes, and discard a big fraction of the predicted inversions as false positives. Furthermore, we have created In- vFEST, the first database of human polymorphic inversions, which represents the most reliable catalogue of inversions and integrates all the associated information from multiple sources. Currently, InvFEST combines information from 34 dif- ferent studies and contains 1092 candidate inversions, which are categorized based on internal scores and manual curation. Finally, the analysis of all the data gener- ated has provided information on the genomic patterns of inversions, contributing decisively to the understanding of the map of human polymorphic inversions. vii RESUMEN Dentro del estudio de las variantes estructurales en el genoma humano, las inversiones presentan retos específicos y han sido poco caracterizadas. Esta tesis aborda este problema a través de la implementación de GRIAL un nuevo algoritmo específicamente diseñado para detectar y localizar de forma precisa las inversiones a partir de datos de mapeo de secuencias apareadas PEM1, que es el método más uti- lizado para estudiar la variación estructural. GRIAL se basa en reglas geométricas para agrupar los patrones de PEM correspondientes a los posibles puntos de rotura y refinar su localización para cada inversión. Los resultados de GRIAL nos per- mitieron predecir cientos de inversiones en el genoma humano. Además, gracias a la creación de índices de fiabilidad para las predicciones, se ha podido identificar patrones de inversión incorrectos y sus causas, descartando un gran número de predicciones posiblemente falsas. Por otra parte, se ha creado InvFEST, la primera base de datos dedicada a inversiones polimórficas en el genoma humano, la cual representa el catálogo más fiable de inversiones e integra toda la información aso- ciada disponible de múltiples fuentes. Actualmente, InvFEST combina informa- ción de 34 estudios diferentes e incluye 1092 inversiones clasificadas según crite- rios internos y anotación manual. Por último el análisis de toda la información generada, nos ha permitido describir los patrones genómicos de las inversiones contribuyendo decisivamente a descifrar el mapa de las inversiones polimórficas humanas. 1del inglés paired-end mapping (PEM) ix PREFACE A little more than a decade pass from the completion of the Human Genome Project, the efficiency of DNA sequencing has drastically improved. The high throughput next generation sequencing (NGS) technologies have been reducing the cost and are increas- ing the capacity for sequence production of thousand of individuals at an unprecedented rate on within different international project [1000 Genomes Project et al., 2012; Inter- national Cancer Genome et al., 2010]. These newest sequencing technologies adding to traditional Sanger sequencing have significantly changed how genomic research is con- ducted, and have provided new insights on the genetic basis of phenotypic and disease- susceptibility differences between individuals through the uncovered an unprecedented degree of structural variation in the human genome. Although the Database of Genomic Variants (DGV) is showing the great advances on identifying of several types of variations among individual genomes. The genomic inversions have been relatively disregarded compared to copy number variations (CNVs) due to their difficulty of study that makes the analysis of these variants on the sequenced genomes very challenging mainly due to the complex, and repetitive nature of human genomes that in particular become much harder when dealing with balanced rearrange- ments. This thesis starts in chapter 1 with a general introduction on the study of structural variation with particular emphasizing in the inversions. This chapter exposed the state- ment of the scientific problem addressed and the main objective of the thesis project. The results have three integral parts that have a strong connection to each other from a methodological point of view, each part is addressing different levels of study of the xi PREFACE inversions and are represented in the three manuscript of papers in chapters of results (2, 3, 4). The first result in chapters 2 is focused on genome sequence analysis, in particular development of computational methods for structural variation discovery in sequenced genomes. We present our effort in developing novel paired-end mapping algorithm for identifying specifically inversions, adding also two scores to assess the reliability of the predictions, as well as a new dataset of inversion predictions discovered in multiple sample genomes. In this part, we also address some problems of paired-end sequencing experiment, based on fosmid clones. The second result of this thesis in chapters 3, is focused on develop a data base for a comprehensive integration of all information about the inversions in the human genome. In this case we create an alternative data source centered on human inversions studies, to store the predicted inversions and their accu- rate break points, validation status or population distribution among other data. The last third result of this thesis in the chapters 4 is, however, focused on the descriptive analysis of genomic patterns of the current information about human inversion poly- morphisms. We discuss several features of the most reliable catalogue of inversion that represent a preliminary characterization of this variant in the human genome that is paramount to better understanding of the possible biases and trends in the detection of inversions in human genome. The chapter 5 is a general discussion that presents in a integrated global result the three topic developed in the thesis project and their contributions. The thesis concludes with the chapter 6 that are a general conclusion of the thesis. The different results of this work have been presented at the RECOMB/ISCB 2012 and ISMB/ECCB 2012 and 2013 conferences. The chapter 3 is already published in Nucleic Acids Research (NAR) 2013. Finally in the appendix part are also presented two scientific articles in which the developing of this thesis has collaborated. xii CONTENTS Abstract vii Resumen ix Preface xi List of Tables xvii List of Figures xix Glossary xxi 1 General introduction 1 1.1 The study of the human genome . 1 1.2 Genomic structural variation . 4 1.3 Copy number variation (CNV) . 7 1.4 Genomic inversions . 9 1.4.1 Origin of genomic inversions . 11 1.4.1.1 Homologous recombination mechanisms . 12 1.4.1.2 Non-homologous and microhomology mechanisms . 15 1.4.2 Inversion effects . 18 1.4.2.1 Mutational effect .