Development and Application of Bioinformatic Tools for the Representation and Analysis of Genetic Diversity
Total Page:16
File Type:pdf, Size:1020Kb
Development and application of bioinformatic tools for the representation and analysis of genetic diversity THE COVER The cover illustrates artistically the kind of data and methods used in this thesis, as well as displays screen shots of the PDA and the DPDB websites that have been created. Kindly designed by Miquel Ràmia. Development and application of bioinformatic tools for the representation and analysis of genetic diversity Desenvolupament i aplicació d’eines bioinformàtiques per a la representació i anàlisi de la diversitat genètica Desarrollo y aplicación de herramientas bioinformáticas para la representación y análisis de la diversidad genética DOCTORAL THESIS Sònia Casillas Viladerrams Universitat Autònoma de Barcelona · Facultat de Biociències · Departament de Genètica i de Microbiologia Bellaterra, 2007 Memòria presentada per la Llicenciada en Biologia Sònia Casillas Viladerrams per a optar al grau de Doctora en Ciències Biològiques. Bellaterra, a 26 d’octubre del 2007 El Doctor Antonio Barbadilla Prados, Professor titular del Departament de Genètica i de Microbiologia de la Facultat de Biociències de la Universitat Autònoma de Barcelona, i el Doctor Alfredo Ruiz Panadero, Catedràtic del mateix departament, CERTIFIQUEN: que la Sònia Casillas Viladerrams ha dut a terme sota la seva direcció el treball de recerca realitzat en el Departament de Genètica i de Microbiologia de la Facultat de Biociències de la Universitat Autònoma de Barcelona que ha portat a l’elaboració d’aquesta Tesi Doctoral, titulada “Desenvolupament i aplicació d’eines bioinformàtiques per a la representació i anàlisi de la diversitat genètica”. I perquè consti als efectes oportuns, signen el present certificat a Bellaterra, a 26 d’octubre del 2007. Dr. Antonio Barbadilla Prados Dr. Alfredo Ruiz Panadero ⎯ Al Jordi, pel teu amor i suport incondicional ⎯ Als pares, per haver-me guiat i aconsellat sempre ⎯ A la iaia, pel teu inesgotable optimisme ⎯ A la Georgina, la Berta i tota la resta de família i amics, per les estones tant agradables compartides i perquè el temps que he dedicat a aquesta tesi no l’he pogut dedicar a vosaltres Table of Contents PREFACE | PRÒLEG | PRÓLOGO xiv PART 1 | INTRODUCTION 1 1.1. The variational paradigm of biological evolution 3 1.2. Genetic diversity 4 1.2.1. The genetic material as the carrier of the evolutionary process 4 1.2.2. The dynamics of genetic variation 5 1.2.3. One century of population genetics 6 1.2.4. Estimating genetic diversity from nucleotide sequences 11 1.2.5. The population size paradox: from genetic drift to genetic draft 14 1.2.6. Testing the neutral theory with polymorphism and divergence data 17 1.3. Bioinformatics of genetic diversity 24 1.3.1. Management and integration of massive heterogeneous data: the need for resources 24 1.3.2. Databases of nucleotide variation 26 1.4. Evolution of noncoding DNA 28 1.4.1. The amount of noncoding DNA and organismal complexity 28 1.4.2. Structure of the eukaryotic gene 30 1.4.3. Conserved noncoding sequences and the fraction of functionally important DNA in the genome 33 1.4.4. The role of ncDNA in maintaining morphological diversity 36 1.5. Hox genes and their role in the evolution of morphology 37 1.5.1. Origin of the body plan in animals 37 1.5.2. Homeotic genes and the discovery of the homeobox 38 1.5.3. The Hox gene complex 40 1.5.4. New functions for some insect Hox genes 40 x Table of Contents 1.6. Drosophila as a model organism 45 1.7. Objectives 47 1.7.1. Development of tools for data mining, processing, filtering and quality checking of raw data 47 1.7.2. Generation of databases of knowledge 49 1.7.3. Testing of hypotheses 49 PART 2 | RESULTS 51 2.1. Pipeline Diversity Analysis (PDA): a bioinformatic system to explore and estimate polymorphism in large DNA databases 53 Article 1: CASILLAS, S. and A. BARBADILLA (2004) PDA: a pipeline to explore and estimate polymorphism in large DNA databases. Nucleic Acids Research 32: W166-W169 54 Article 2: CASILLAS, S. and A. BARBADILLA (2006) PDA v.2: improving the exploration and estimation of nucleotide polymorphism in large datasets of heterogeneous DNA. Nucleic Acids Research 34: W632-W634 58 2.2. Drosophila Polymorphism Database (DPDB): a secondary database of nucleotide diversity in the genus Drosophila 61 Article 3: CASILLAS, S., N. PETIT and A. BARBADILLA (2005) DPDB: a database for the storage, representation and analysis of polymorphism in the Drosophila genus. Bioinformatics 21: ii26-ii30 62 Article 4: CASILLAS, S., R. EGEA, N. PETIT, C. M. BERGMAN and A. BARBADILLA (2007) Drosophila Polymorphism Database (DPDB): a portal for nucleotide polymorphism in Drosophila. Fly 1: 205-211 67 Supplementary Material 74 2.3. Using patterns of sequence evolution to infer functional constraint in Drosophila noncoding DNA 77 Article 5: CASILLAS, S., A. BARBADILLA and C. M. BERGMAN (2007) Purifying selection maintains highly conserved noncoding sequences in Drosophila. Molecular Biology and Evolution 24: 2222-2234 78 Supplementary Material 91 Table of Contents xi 2.4. Coding evolution of Hox and Hox-derived genes in the genus Drosophila 105 Article 6: CASILLAS, S., B. NEGRE, A. BARBADILLA and A. RUIZ (2006) Fast sequence evolution of Hox and Hox-derived genes in the genus Drosophila. BMC Evolutionary Biology 6: 106 106 Supplementary Material 121 PART 3 | DISCUSSION 133 3.1. Bioinformatics of genetic diversity 135 3.1.1. The challenge: automating the estimation of genetic diversity 136 3.1.2. Data model 137 3.1.3. Data gathering and processing 137 3.1.4. Confidence assessment of each polymorphic set 143 3.1.5. Testing the quality of the PDA output 143 3.1.6. DPDB contents and quality of the data 144 3.1.7. Using PDA and DPDB for large-scale analyses of genetic diversity 145 3.1.8. What is next? 148 3.2. Using patterns of sequence evolution to infer constraint and adaptation in Drosophila noncoding DNA 149 3.2.1. Drosophila noncoding DNA 149 3.2.2. Detection of conserved noncoding sequences by comparative genomics 150 3.2.3. Evolutionary forces governing patterns of noncoding DNA in Drosophila 151 3.2.4. An integrative model of cis-regulatory evolution for Drosophila 156 3.2.5. Will the same model apply to other species? 159 3.3. Coding evolution of Hox genes: fast divergence or a paradigm of functional conservation? 160 3.3.1. Predicting the rate of evolution of Hox genes 161 3.3.2. Estimating the rate of evolution of Hox genes: nucleotides versus indels 162 xii Table of Contents 3.3.3. The impact of homopeptides and other short repetitive motifs in the coding evolution of Hox proteins 163 3.3.4. An integrative view of the co-evolution between transcription factor coding regions and cis-regulatory DNA 165 PART 4 | CONCLUSIONS | CONCLUSIONS | CONCLUSIONES 169 APPENDIX I 181 EGEA, R., S. CASILLAS, E. FERNÁNDEZ, M. A. SENAR and A. BARBADILLA (2007) MamPol: a database of nucleotide polymorphism in the Mammalia class. Nucleic Acids Research 35: D624- D629 APPENDIX II 189 PETIT, N., S. CASILLAS, A. RUIZ and A. BARBADILLA (2007) Protein polymorphism is negatively correlated with conservation of intronic sequences and complexity of expression patterns in Drosophila melanogaster. Journal of Molecular Evolution 64: 511-518 APPENDIX III 199 NEGRE, B., S. CASILLAS, M. SUZANNE, E. SÁNCHEZ-HERRERO, M. AKAM, M. NEFEDOV, A. BARBADILLA, P. DE JONG and A. RUIZ (2005) Conservation of regulatory sequences and gene expression patterns in the disintegrating Drosophila Hox gene complex. Genome Research 15: 692-700 BIBLIOGRAPHY 211 WEB REFERENCES 227 ABBREVIATIONS 229 INDEX OF TABLES 233 INDEX OF FIGURES 234 INDEX OF BOXES 236 Table of Contents xiii Preface Genetic variation is the cornerstone of biological evolution. The description and explanation of the forces controlling genetic variation within and between populations is the main goal of population genetics (LEWONTIN 2002). The deciphering of an explosive number of nucleotide sequences in different genes and species has changed radically the scope of population genetics, transforming it from an empirically insufficient science into a powerfully explanatory interdisciplinary endeavor, where high-throughput instruments generating new sequence data are integrated with bioinformatic tools for data mining and management, and advanced theoretical and statistical models for data interpretation. This thesis is an integrative and comprehensive bioinformatics and population genetics project whose central topic is the genetic diversity of populations. It is accomplished in three sequential steps: (i) the development of tools for data mining, processing, filtering and quality checking of raw data, (ii) the generation of databases of knowledge from refined data obtained in the first step, and (iii) the testing of hypotheses that require multi-species and/or multi-locus data. In the first part of the thesis, we have developed PDA —Pipeline Diversity Analysis—, an open-source, web-based tool that allows the exploration of polymorphism in large datasets of heterogeneous DNA sequences. This tool feeds from the millions of haplotypic sequences from individual studies that are stored in the main molecular biology databases, and generates high- quality, population genetics data that can be used to describe patterns of nucleotide variation in any species or gene. All the extracted and analyzed data resulting from the first part of this thesis is used in the second step to create a comprehensive on-line resource that provides searchable collections of polymorphic sequences with their associated diversity measures in the genus Drosophila (DPDB —Drosophila Polymorphism Database—). This resource means an ambitious pledge to test the efficiency of the system created in the first part. Finally, two different studies that make use of the modules of data mining and analysis developed are shown.