Evolution of Gene Blocks in Bacteria Using an Event-Based Model
Total Page:16
File Type:pdf, Size:1020Kb
Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2020 Evolution of gene blocks in bacteria using an event-based model Huy Nguyen Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Recommended Citation Nguyen, Huy, "Evolution of gene blocks in bacteria using an event-based model" (2020). Graduate Theses and Dissertations. 18368. https://lib.dr.iastate.edu/etd/18368 This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Evolution of gene blocks in bacteria using an event-based model by Huy Nguyen A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Computer Science Program of Study Committee: Oliver Eulenstein, Co-major Professor Iddo Friedberg, Co-major Professor David Fernandez-Baca Dennis V. Lavrov Xiaoqiu Huang The student author, whose presentation of the scholarship herein was approved by the program of the study committee, is solely responsible for the content of this dissertation. The Graduate College will ensure this dissertation is globally accessible and will not permit alterations after a degree is conferred. Iowa State University Ames, Iowa 2020 Copyright © Huy Nguyen, 2020. All rights reserved. ii DEDICATION To my parents, Danh Nguyen and Hoa Tran, who have encouraged, inspired, and enabled me to pursue my educational studies, and my family and friends for their support throughout my studies at home and abroad. iii TABLE OF CONTENTS Page LIST OF TABLES . .v LIST OF FIGURES . vi ACKNOWLEDGMENTS . ix ABSTRACT . .x CHAPTER 1. INTRODUCTION . .1 CHAPTER 2. RELATED WORKS . .4 2.1 Gene Block Evolution Models . .4 2.2 Finding Orthologous Gene Blocks in Bacteria . .8 2.3 Ancestral Reconstruction of Gene Blocks in Bacteria . .9 CHAPTER 3. PRELIMINARIES . 13 3.1 Event-based model . 13 3.2 Phylogenetic tree . 16 CHAPTER 4. FINDING ORTHOLOGOUS GENE BLOCKS IN BACTERIA: THE COM- PUTATIONAL HARDNESS OF THE PROBLEM AND NOVEL METHODS TO AD- DRESS IT . 18 4.1 The Relevant-Gene-Block problem . 19 4.2 The RGB problem is Hard . 20 4.3 Solving the RGB problem . 23 4.3.1 Greedy Algorithm . 24 4.3.2 Integer Linear Programming . 25 4.4 Results and Discussion . 26 4.4.1 Scalability Study . 27 4.4.2 Accuracy Study . 28 4.4.3 Concluding Discussion . 30 4.5 Conclusion . 31 CHAPTER 5. TRACING THE ANCESTRY OF OPERONS IN BACTERIA . 32 5.1 The Ancestral-Reconstruction-Gene-Block problem . 32 5.2 Solving the ARGB problems . 33 5.2.1 Local Algorithm . 33 5.2.2 Global Algorithm . 38 iv 5.3 Results and Discussion . 41 5.3.1 Operons using Escherichia coli as a reference taxon . 41 5.3.2 Operons using Bacillus subtilis as a reference taxon . 45 5.4 Conclusions . 48 CHAPTER 6. UNRAVELING A TANGLED SKEIN: EVOLUTIONARY ANALYSIS OF THE BACTERIAL GIBBERELLIN BIOSYNTHETIC OPERON . 50 6.1 Gene clusters analysis . 51 6.2 Results and Discussion . 59 6.3 Appendix: Supplementary Materials . 69 CHAPTER 7. DISCUSSION AND FUTURE DIRECTIONS . 85 7.1 Finding Orthologous Gene Blocks in Bacteria: the Computational Hardness of the Problem and Novel Methods to Address It . 85 7.1.1 Discussions . 85 7.1.2 Future directions . 86 7.2 Tracing the Ancestry of Operons in Bacteria . 88 7.2.1 Discussions . 88 7.2.2 Future directions . 89 7.3 Unraveling a Tangled Skein: Evolutionary Analysis of the Bacterial Gibberellin Biosynthesis Operon . 90 7.3.1 Discussions . 90 7.3.2 Future directions . 91 BIBLIOGRAPHY . 93 v LIST OF TABLES Page Table 2.1 Pairwise events for the orthoblocks shown A-E . .8 Table 6.1 Pairwise alignments of GGPS2 proteins and representative GGPS proteins. Protein sequences were aligned using the MUSCLE algorithm (Geneious Prime; default settings). The GGPS proteins included in this analysis were selected due to the relatively close phylogenetic relationship between the core operon of this strain and that of the ggps2-containing strains, which lack a full-length ggps gene (i.e. Rhizobium etli bv. mimosae str. Mim1 has the most similar core operon to the ggps2-containing Rhizobium strains and Bradyrhizobium sp. WSM1417 has the most similar core operon to ggps2- containing Bradyrhizobium strains). Sequences with greater 90% amino acid identity are highlighted in dark blue, while those with greater 80% but less than 90% amino acid identity are highlighted in light blue . 69 Table 6.2 List of strains used within the ancestral reconstruction analyses. Shown for each strain is the corresponding GenBank genome accession, the legume host from which it was isolated, and the nodule type (D = determinate, I = indeterminate, * = neither). Note that the legume host and nodule type are irrelevant for the gammaproteobacteria plant pathogens analyzed in these reconstructions. 70 Table 6.3 Strains containing a CYP 114 − F dGA gene fusion. 83 Table 6.4 Strains containing a F dGA − SDRGA gene fusion. 83 vi LIST OF FIGURES Page Figure 2.1 Species C has an experimentally determined operon (Black arrows), and serves as the reference taxon. The orthologs in species A, B, D, and E were determined as explained in the text. The events between C and all other species for this orthoblock are A{C: deletion (of gene c); B{C: split (of gene c), C{D: duplication (of b) and split (jagged line); C{E: duplication (of b).7 Figure 2.2 Given a tree rooted at node x and the leaf nodes are labeled with nucleotides, the reconstruction problem finds nucleotide assignment for each inner node. Figure 2.2b is an example of such assignment. 12 Figure 4.1 The x-axis is the operon/ gene block names, we sort the gene blocks in ascending order according to the number of genes in each block. From left to right: operon ast to operon ygb has 5 genes in the gene block, operon cai to operon ydh has 6 genes in the gene block, operon cas to operon tdc has 7 genes in the gene block, operon bam to operon ydg has 8 genes in the gene block, operon atp to operon yje has 9 genes in the gene block, operon paa to operon yrb has 11 genes, operon hyf has 12 genes, operon nuo has 13 genes. The y-axis is the running time of the method in a log10-scale second 27 Figure 4.2 The x-axis is the same as Figure 2. The y-axis is the number of event count for the orthologous gene block in each species compared to the reference gene block. In Figure 3a,3b,3c, a dot represent the deletion events, split events, sum of deletion events, and split events respectively. Red dots represent the Heuristic method. Blue dots represent the Greedy method. Green dots represent the ILP method. 30 Figure 5.1 Ancestral reconstruction of operon atpIBEFHAGDC using the global op- timization approach.The lower-case letters in each tree node represent the genes in the orthoblock (e.g. \a" represents \atpA", see legend in blue bar, top). A `j' designates a split (i.e. a distance ≥500bp between the genes to either side of the `j'). The green bar on top shows the total number of events that took place in this reconstruction. The numbers in the brackets in the inner nodes are a 3-tuple showing the cumulative count of events going from the leafnodes to the tree root in the following order: [deletions, duplica- tions, splits]. No orthologus gene blocks were found in species labeled with an asterisk (∗). The reference genome E. coli is marked with a box. These naming and color conventions persist through this study. 42 Figure 5.2 Ancestral gene block reconstruction of operon paaABCDEFGHIJK using the global reconstruction approach. 44 Figure 5.3 Ancestral reconstruction of lepA-hemN-hrcA-grpE-dnaK-dnaJ-prmA-yqeU- rimO......................................... 46 vii Figure 5.4 Ancestral reconstruction of mmgABCDE-prpB................. 47 Figure 6.1 Diversity among GA biosynthetic operons in divergent bacterial lineages. The core operon genes are defined as cyp112, cyp114, fd, sdr, cyp117, ggps, cps, and ks, as these are almost always present within the GA operon. Other genes, including cyp115, idi, and ids2, exhibit a more limited distribution among GA operon-containing species. The tandem diagonal lines in the Mesorhizobium loti MAFF303099 operon indicates that cyp115 is not lo- cated adjacent to the rest of the operon. 51 Figure 6.2 Pipeline of the method to reconstruct the ancestral state of gibberellin operon . 52 Figure 6.3 Reduced ancestral reconstruction of the GA biosynthetic operon using rpoB for phylogenetic analysis. A phylogenetic tree was constructed with align- ments of rpoB protein sequences from 118 α-rhizobia species and two gammapro- teobacteria using the Neighbor-Joining method as a measure of the distance between species. The number of analyzed species was reduced to 64 with the Phylogenetic Diversity Analyzer software, and ROAGUE was then applied to create the ancestral operon reconstruction. Annotations are defined in section 6.1..................................... 54 Figure 6.4 Reduced ancestral reconstruction of the GA biosynthetic operon using the concatenated operon for phylogenetic analysis. A phylogenetic tree was constructed with alignments of concatenated proteins from core GA operon genes (cyp112-cyp114-fd-sdr-cyp117-cps-ks) from 118 α-rhizobia species and two gammaproteobacteria using the Neighbor-Joining method. The num- ber of analyzed species was reduced to 64 with the Phylogenetic Diversity Analyzer software, and ROAGUE was then applied to create the ancestral operon reconstruction.