Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations
2020
Evolution of gene blocks in bacteria using an event-based model
Huy Nguyen Iowa State University
Follow this and additional works at: https://lib.dr.iastate.edu/etd
Recommended Citation Nguyen, Huy, "Evolution of gene blocks in bacteria using an event-based model" (2020). Graduate Theses and Dissertations. 18368. https://lib.dr.iastate.edu/etd/18368
This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Evolution of gene blocks in bacteria using an event-based model
by
Huy Nguyen
A dissertation submitted to the graduate faculty
in partial fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHY
Major: Computer Science
Program of Study Committee: Oliver Eulenstein, Co-major Professor Iddo Friedberg, Co-major Professor David Fernandez-Baca Dennis V. Lavrov Xiaoqiu Huang
The student author, whose presentation of the scholarship herein was approved by the program of the study committee, is solely responsible for the content of this dissertation. The Graduate College will ensure this dissertation is globally accessible and will not permit alterations after a degree is conferred.
Iowa State University
Ames, Iowa
2020
Copyright © Huy Nguyen, 2020. All rights reserved. ii
DEDICATION
To my parents, Danh Nguyen and Hoa Tran, who have encouraged, inspired, and enabled me to pursue my educational studies, and my family and friends for their support throughout my studies at home and abroad. iii
TABLE OF CONTENTS
Page
LIST OF TABLES ...... v
LIST OF FIGURES ...... vi
ACKNOWLEDGMENTS ...... ix
ABSTRACT ...... x
CHAPTER 1. INTRODUCTION ...... 1
CHAPTER 2. RELATED WORKS ...... 4 2.1 Gene Block Evolution Models ...... 4 2.2 Finding Orthologous Gene Blocks in Bacteria ...... 8 2.3 Ancestral Reconstruction of Gene Blocks in Bacteria ...... 9
CHAPTER 3. PRELIMINARIES ...... 13 3.1 Event-based model ...... 13 3.2 Phylogenetic tree ...... 16
CHAPTER 4. FINDING ORTHOLOGOUS GENE BLOCKS IN BACTERIA: THE COM- PUTATIONAL HARDNESS OF THE PROBLEM AND NOVEL METHODS TO AD- DRESS IT ...... 18 4.1 The Relevant-Gene-Block problem ...... 19 4.2 The RGB problem is Hard ...... 20 4.3 Solving the RGB problem ...... 23 4.3.1 Greedy Algorithm ...... 24 4.3.2 Integer Linear Programming ...... 25 4.4 Results and Discussion ...... 26 4.4.1 Scalability Study ...... 27 4.4.2 Accuracy Study ...... 28 4.4.3 Concluding Discussion ...... 30 4.5 Conclusion ...... 31
CHAPTER 5. TRACING THE ANCESTRY OF OPERONS IN BACTERIA ...... 32 5.1 The Ancestral-Reconstruction-Gene-Block problem ...... 32 5.2 Solving the ARGB problems ...... 33 5.2.1 Local Algorithm ...... 33 5.2.2 Global Algorithm ...... 38 iv
5.3 Results and Discussion ...... 41 5.3.1 Operons using Escherichia coli as a reference taxon ...... 41 5.3.2 Operons using Bacillus subtilis as a reference taxon ...... 45 5.4 Conclusions ...... 48
CHAPTER 6. UNRAVELING A TANGLED SKEIN: EVOLUTIONARY ANALYSIS OF THE BACTERIAL GIBBERELLIN BIOSYNTHETIC OPERON ...... 50 6.1 Gene clusters analysis ...... 51 6.2 Results and Discussion ...... 59 6.3 Appendix: Supplementary Materials ...... 69
CHAPTER 7. DISCUSSION AND FUTURE DIRECTIONS ...... 85 7.1 Finding Orthologous Gene Blocks in Bacteria: the Computational Hardness of the Problem and Novel Methods to Address It ...... 85 7.1.1 Discussions ...... 85 7.1.2 Future directions ...... 86 7.2 Tracing the Ancestry of Operons in Bacteria ...... 88 7.2.1 Discussions ...... 88 7.2.2 Future directions ...... 89 7.3 Unraveling a Tangled Skein: Evolutionary Analysis of the Bacterial Gibberellin Biosynthesis Operon ...... 90 7.3.1 Discussions ...... 90 7.3.2 Future directions ...... 91
BIBLIOGRAPHY ...... 93 v
LIST OF TABLES
Page Table 2.1 Pairwise events for the orthoblocks shown A-E ...... 8 Table 6.1 Pairwise alignments of GGPS2 proteins and representative GGPS proteins. Protein sequences were aligned using the MUSCLE algorithm (Geneious Prime; default settings). The GGPS proteins included in this analysis were selected due to the relatively close phylogenetic relationship between the core operon of this strain and that of the ggps2-containing strains, which lack a full-length ggps gene (i.e. Rhizobium etli bv. mimosae str. Mim1 has the most similar core operon to the ggps2-containing Rhizobium strains and Bradyrhizobium sp. WSM1417 has the most similar core operon to ggps2- containing Bradyrhizobium strains). Sequences with greater 90% amino acid identity are highlighted in dark blue, while those with greater 80% but less than 90% amino acid identity are highlighted in light blue ...... 69 Table 6.2 List of strains used within the ancestral reconstruction analyses. Shown for each strain is the corresponding GenBank genome accession, the legume host from which it was isolated, and the nodule type (D = determinate, I = indeterminate, * = neither). Note that the legume host and nodule type are irrelevant for the gammaproteobacteria plant pathogens analyzed in these reconstructions...... 70
Table 6.3 Strains containing a CYP 114 − F dGA gene fusion...... 83
Table 6.4 Strains containing a F dGA − SDRGA gene fusion...... 83 vi
LIST OF FIGURES
Page Figure 2.1 Species C has an experimentally determined operon (Black arrows), and serves as the reference taxon. The orthologs in species A, B, D, and E were determined as explained in the text. The events between C and all other species for this orthoblock are A–C: deletion (of gene c); B–C: split (of gene c), C–D: duplication (of b) and split (jagged line); C–E: duplication (of b).7 Figure 2.2 Given a tree rooted at node x and the leaf nodes are labeled with nucleotides, the reconstruction problem finds nucleotide assignment for each inner node. Figure 2.2b is an example of such assignment...... 12 Figure 4.1 The x-axis is the operon/ gene block names, we sort the gene blocks in ascending order according to the number of genes in each block. From left to right: operon ast to operon ygb has 5 genes in the gene block, operon cai to operon ydh has 6 genes in the gene block, operon cas to operon tdc has 7 genes in the gene block, operon bam to operon ydg has 8 genes in the gene block, operon atp to operon yje has 9 genes in the gene block, operon paa to operon yrb has 11 genes, operon hyf has 12 genes, operon nuo has 13 genes. The y-axis is the running time of the method in a log10-scale second 27 Figure 4.2 The x-axis is the same as Figure 2. The y-axis is the number of event count for the orthologous gene block in each species compared to the reference gene block. In Figure 3a,3b,3c, a dot represent the deletion events, split events, sum of deletion events, and split events respectively. Red dots represent the Heuristic method. Blue dots represent the Greedy method. Green dots represent the ILP method...... 30 Figure 5.1 Ancestral reconstruction of operon atpIBEFHAGDC using the global op- timization approach.The lower-case letters in each tree node represent the genes in the orthoblock (e.g. “a” represents “atpA”, see legend in blue bar, top). A ‘|’ designates a split (i.e. a distance ≥500bp between the genes to either side of the ‘|’). The green bar on top shows the total number of events that took place in this reconstruction. The numbers in the brackets in the inner nodes are a 3-tuple showing the cumulative count of events going from the leafnodes to the tree root in the following order: [deletions, duplica- tions, splits]. No orthologus gene blocks were found in species labeled with an asterisk (∗). The reference genome E. coli is marked with a box. These naming and color conventions persist through this study...... 42 Figure 5.2 Ancestral gene block reconstruction of operon paaABCDEFGHIJK using the global reconstruction approach...... 44 Figure 5.3 Ancestral reconstruction of lepA-hemN-hrcA-grpE-dnaK-dnaJ-prmA-yqeU- rimO...... 46 vii
Figure 5.4 Ancestral reconstruction of mmgABCDE-prpB...... 47 Figure 6.1 Diversity among GA biosynthetic operons in divergent bacterial lineages. The core operon genes are defined as cyp112, cyp114, fd, sdr, cyp117, ggps, cps, and ks, as these are almost always present within the GA operon. Other genes, including cyp115, idi, and ids2, exhibit a more limited distribution among GA operon-containing species. The tandem diagonal lines in the Mesorhizobium loti MAFF303099 operon indicates that cyp115 is not lo- cated adjacent to the rest of the operon...... 51 Figure 6.2 Pipeline of the method to reconstruct the ancestral state of gibberellin operon ...... 52 Figure 6.3 Reduced ancestral reconstruction of the GA biosynthetic operon using rpoB for phylogenetic analysis. A phylogenetic tree was constructed with align- ments of rpoB protein sequences from 118 α-rhizobia species and two gammapro- teobacteria using the Neighbor-Joining method as a measure of the distance between species. The number of analyzed species was reduced to 64 with the Phylogenetic Diversity Analyzer software, and ROAGUE was then applied to create the ancestral operon reconstruction. Annotations are defined in section 6.1...... 54 Figure 6.4 Reduced ancestral reconstruction of the GA biosynthetic operon using the concatenated operon for phylogenetic analysis. A phylogenetic tree was constructed with alignments of concatenated proteins from core GA operon genes (cyp112-cyp114-fd-sdr-cyp117-cps-ks) from 118 α-rhizobia species and two gammaproteobacteria using the Neighbor-Joining method. The num- ber of analyzed species was reduced to 64 with the Phylogenetic Diversity Analyzer software, and ROAGUE was then applied to create the ancestral operon reconstruction. and ROAGUE was then applied to create the ances- tral operon reconstruction. Annotations are defined in section 6.1...... 55 Figure 6.5 Summary of ancestral reconstruction for the GA biosynthetic operon. As a representation of GA operon evolution, the results from the full reconstruc- tion generated using the concatenated operon (FO) are summarized here. The initial loss of the cyp115 gene is indicated with a red arrow, while the reacquisition of this gene is indicated with a green circle at the ancestral node. The loss of ggps and acquisition of ggps2 is indicated by a blue box at the ancestral node. Brackets around a gene represent a variable presence within that lineage. Double slanted lines indicate genes that are not lo- cated within the cluster (i.e. >500 bp away). The family of proteobacterial lineages is indicated to the right of the figure (α and γ labels)...... 58 Figure 6.6 Ancestral reconstruction of the GA biosynthetic operon using rpoB for phylogenetic analysis. A phylogenetic tree was constructed with align- ments of rpoB protein sequences from 118 α-rhizobia species and two γ- proteobacteria using the Neighbor-Joining method as a measure of the dis- tance between species. ROAGUE was then applied to create the ancestral operon reconstruction. Annotations are defined in section 6.1...... 67 viii
Figure 6.7 Ancestral reconstruction of the GA biosynthetic operon using the concate- nated operon for phylogenetic analysis. A phylogenetic tree was constructed with alignments of concatenated proteins from core GA operon genes (cyp112- cyp114-fd-sdr-cyp117-cps-ks) from 118 α-rhizobia species and two β-proteobacteria using the Neighbor-Joining method. ROAGUE was then applied to create the ancestral operon reconstruction. ROAGUE was then applied to create the ancestral operon reconstruction. Annotations are defined in section 6.1. 68 ix
ACKNOWLEDGMENTS
I would like to express my gratitude to my advisor Dr. Oliver Eulenstein. He has been helpful in guiding me on this long journey. I would also like to thank my co-advisor, Dr. Iddo Friedberg, for his continous support and guidance throughout my graduate program of study. My journey has been significantly impacted by Dr. Friedberg. Many thanks to my committee members: Drs.
David Fernandez-Baca, Dennis Lavrov, and Xiaoqiu Huang for their great suggestions toward my scientific researches. I especially grateful to my brother, Duy Nguyen, for his encouragement and support throughout my program of study. Last but not least, I sincerely appreciate my girlfriend,
Nhi Van, who has encouraged me and listened to me when I struggled with difficulties and rejoiced with me at my successes. x
ABSTRACT
A complex system is a group or organization in which many components interact to perform functions that a single component cannot. The study of the evolution of complex systems can reveal the pattern of interactions, the dependencies between entities, the impacts on the system’s environment, etc. In bacterial genomes, a gene block can be thought of as a complex system.
Gene blocks are genes that are co-located on chromosomes, whose evolutionary conservation is essential and useful in phylogenetic and functional studies. Conservation of gene blocks across several species indicates that some of the gene blocks are operons, which are gene blocks that are transcribed together into an mRNA under the control of a single promoter. Previously, an event- based model was introduced to study the evolution of gene blocks in bacteria. It was proven to be practical and efficient in examining the formation of gene block. In this dissertation research, I present three studies on the evolution of gene blocks in bacteria using the event-based model.
In the first study, we introduced the formalization of the event-based model and used it to study the problem of finding orthologous gene blocks in bacteria. Unfortunately, the problem is NP-Hard, which means it is unlikely to have an efficient algorithm to solve the problem exactly. Additionally, it is hard to approximate the problem within a ratio of log2(n)/2 ' .72 ln n where n is the number of genes in the reference gene blocks. Fortunately, it is possible to approximate it within a ratio ln n using a variant of the greedy algorithm of the Set Cover problem. To evaluate the optimality of this method, we formulated the problem as an Integer Linear Program. Using a data set from my previous paper, we compared my method with the state-of-the-art method and showed that mine was more efficient and optimal.
In the second study, we explored the evolution of the homologous gene blocks under the event- based model. Using the maximum parsimony approach, we developed two heuristic algorithms to reconstruct the ancestral state of gene blocks of a group of taxa given a reference gene block. We xi applied these two methods to study the evolution of 55 operons in Escherichia coli and 83 operons in Bacillus subtilis. From the results, we discovered several plausible intermediate states of gene blocks which might be fully or partially functional. From the reconstruction of the ancestral states of gene block, we formed hypotheses on how the gene blocks evolved, and what external forces could have impacted on the course of the evolution.
In the third study, we extended the heuristic algorithm from the second study to trace the evolu- tion of bacterial gibberellin biosynthetic operon. Gibberellin hormones are ubiquitous regulators of growth and developmental processes in vascular plants. Although homologous operons encode GA biosynthetic enzymes in both rhizobia and phytopathogens, there is a clear functional distinction between them. Using the reconstruction algorithm, we characterized losses, gains, and duplications of the gibberellin operon genes within alphaproteobacteria rhizobia and revealed a complex history of horizontal gene transfer of individual genes and clusters of genes.
In conclusion, the event-based model provides a powerful tool to study the evolution of gene blocks in bacteria. Although the problems are NP-hard, heuristic approaches can provide mean- ingful results within a reasonable time. In the future work section, I present directions to extend our studies, as well as problems they might solve. 1
CHAPTER 1. INTRODUCTION
Bacterial genes that encode proteins in a common pathway may cluster together. Those genes may sometimes found in their biochemical reaction order (Demerec and Hartman, 1959). Some- times, blocks of genes (or gene blocks) are co-transcribed together as an operon, sharing the same promoter and regulator, and are, in this case, often associated with a single function. It is esti- mated that 55% of genes of E. coli reside within an operon (Wolf et al., 2001). Operons help cells to express or repress a group of genes efficiently. An essential question in evolutionary study is how the bacterial gene blocks and operons evolve. Several models were proposed to explain the formation of these structures, such as the natal model (Horowitz, 1945), the selfish operon model
(Lawrence and Roth, 1996), the persistent model (Fang et al., 2008), etc. However, each of these models only supported a specific biological reasoning and could not be readily applied for every chosen gene block and taxa. Previously, we proposed the event-based model as a universal model to explain gene block and operon evolution. This model enables us to compute the evolutionary changes among gene blocks from different taxa (Ream et al., 2015) and quantify the conservation of different gene blocks in bacteria.
In Chapter4, we define the mathematical problems of finding orthologous gene blocks in bacteria using an event-based model. We show the problem is extremely hard by reducing it to the Minimum-
Set-Cover problem. In addition, we prove it is hard to approximate within a constant of the optimal.
Later, we introduce a heuristic approach that gives ln n approximation that runs in polynomial time.
To evaluate this approach and the approach in (Ream et al., 2015) , we define an Integer Linear
Programming version of the problem and solve it exactly on small datasets. Then, we compare the accuracy and efficiency of the methods. Overall, our approach runs about 100 times faster than the previous approach and 10,000 faster than the Integer Linear Programming approach. The results 2 of this method are more optimal than the past method’s results. The software accompanying our method is available under GPLv3 license on GitHub: Relevant Operon
In Chapter5, we apply the idea of ancestral sequence reconstruction to trace the ancestral gene block in bacteria using the event-based model. We define a cost function for the gene block assignment of a given tree. We propose a global and a local approaches that minimize the given cost function. Using these approaches, we reconstruct the ancestral states of gene block from E. coli and
B. subtilis for a given group of species. From the results, we can identify the intermediate functional forms of the gene block and study the pattern of conservation. We conclude that essentiality is one of the main reason for gene block conservation. The software accompanying our method is available under GPLv3 license on GitHub: ROAGUE
The plant hormones gibberellins (GAs) are essential to plant growth and development. They regulate the elongation of the stem, seed germination, flower, etc (Yamaguchi, 2008). The research of gibberellins first started in 1926, when a Japanese plant pathologist discovered the fungus Gib- berella fujikuroi had produced a substance that affected the rice stems. It was not until the 1950s that more researchers switched their focuses on gibberellin in plants. Recent studies have revealed that among 100 plus GAs identified, only 4 GAs are bioactive: GA1, GA3, GA4, and GA7. Hence, the other GAs exists as precursors for the bioactive forms or deactivated metabolites (Yamaguchi,
2008). In the 20th century, the plant and fungal GA biosynthesis pathways were fully characterized.
However, GA biosynthesis in bacteria is only characterized recently. Through the characterization of the putative GA gene cluster in Bradyrhizobium japonicum, the bacteria biosynthesis pathway of GA was discovered to encode the enzymes to produce GA9 (Nett et al., 2017b). Analysis of this pathway reveals the GA biosynthesis in bacteria has evolved independently of plants and fungi; however, some essential functionalities of the GA operon are still conserved in bacteria through the encoding genes.
In chapter6, we uncovered reasons why bacteria gain the GAs operon and its role in plant- microbe interactions. We applied my ancestral reconstruction method to study the evolution of gibberellin operon in a group of 120 alphaproteobacteria rhizobia. From the analysis, we traced 3 the loss and gain of several important genes of the operons and found horizontal gene transfer of the whole clusters among the species. 4
CHAPTER 2. RELATED WORKS
2.1 Gene Block Evolution Models
The earliest study of pathway evolution suggested that the first living entity was a completely heterotropic unit, which meant that it could only reproduce using the metabolites in its environ- ment (Horowitz, 1945). As the nutrient in the primitive environment became limited, mutations kicked in so that the entity could utilize other available substances. This process eventually led to the arrangement of genes in blocks. However, later studies showed that a lot of bacterial operons are composed of genes that lacked homology (within an operon) (Lawrence and Roth, 1996). Further- more, proteins metB and metC were coded by unlinked genes despite being highly homologous to each other (similar sequences) and catalyzing successive steps in methionine biosynthesis (Belfaiza et al., 1986). Besides, the natal model inferred that gene positions are not important, and there were no direct benefits to the host. This was later on challenged by the persistent model (Fang et al., 2008).
Another hypothesis called the co-adaptation model applied Fisher’s theory of gene clusters to explain the increased linkage among co-adapted genes (genes function together stay together in the genome). Fisher’s theory explains the natural selection process in eukaryote that favors clustering genes related to each other; hence, it reduces the frequency of recombination events (e.g. crossing
- over during meiosis). This theory was later extended to prokaryotes. Experiments on coliphages
(double-stranded DNA bacteriophages) demonstrated that gene clustering in microorganisms was a natural selection against recombination (Stahl and Murray, 1966). However, the Fisher model requires sufficient recombination to select for gene clusters. Such condition is typical for eukaryotes, but not so much for bacteria, which reproduce asexually.
The co-regulation model suggested that the clustering of genes was facilitated by co-regulation.
If the genes cluster, a single operator can induce and repress them simultaneously (Jacob and 5
Monod, 1961; Pardee et al., 1959). Co-regulation infers an enhancement in fitness for individuals that have gene clusters because the genes can be expressed more economically. However, some genes are co-regulated but not clustered, such as genes arg and pyr in E. coli (Beckwith et al.,
1962). Hence, co-regulation does not imply clustering.
The selfish operon model (Lawrence and Roth, 1996) proposed that the formation of gene clusters in bacteria was the result of the transfer of DNA within and among taxa (Lawrence and
Roth, 1996). This model was quite influential and also controversial. The physical proximity of genes improves the probability of successful horizontal transfer, which can supplement the missing genes of a given cell due to mutations or genetic drifts. This coins the property of being ”selfish”, which means the cluster of genes is only beneficial for gene transfer between species but not the host. This hypothesis was supported in bacteria where horizontal gene transfer was mediated by conjugative plasmids, transducing bacteriophages, etc (Syvanen, 1994). However, clusters that tend to lose genes usually do not code for essential functions. Therefore, this model applied well for genes encoding accessory functions. Hence, essential genes were predicted not to be clustered.
However, the operon rpl, and atp which encodes ribosomal proteins and F1Fo − AT P ase are both arranged closely in blocks. Besides, some genes are clustered in some species but not the other.
For example, pur genes are grouped in B. subtilis (11 genes) but are scattered in S. typhimurium.
Those cases are inconsistent with the selfish operon model. Furthermore, this model could not explain the arrangement in the gene clusters that follows the order of biochemical reactions.
The persistent model explained the persistence of gene blocks once they were formed. Under this model, clustering genes were grouped into persistent genes and rare genes. The model only focused on the clustering of highly persistent genes in bacteria (Fang et al., 2008). The model defined the
Persistent Index (PI) as the percentage of bacteria containing a given gene to find an association between genes tend to cluster and their PI. Interestingly, genes with PI ≥ 65% significantly clustered together. In addition, the model suggested that co-transcription and functional coupling helped to maintain the linkage between those persistent genes. Henceforth, the cluster was less likely to be disrupted by the indel events and could resist mutation more easily. 6
All of these models proposed some explanations for the evolution of gene blocks in bacteria. Each of them had a specific hypothesis that worked exceptionally well in several cases, but not so well in others. Several studies suggested that there might be a model that involved multiple mechanisms.
For example, a model that combined several models was able to lead to gene clustering under appropriate conditions (Ballouz et al., 2010). However, these models did not allow us to quantify the change between gene blocks. Henceforth, it is necessary to have a standard method to measure the evolution of gene blocks. The event-based model was proposed to allow us quickly examine the gene block evolution given a group of taxa and compare the conservativeness of gene blocks in bacteria.
The event-based model explains the evolution of gene blocks that are homologous between different bacteria. There are several essential concepts in this model. Reference taxon is the reference species where we have experimentally verified the operons. Neighboring genes are genes in a genome that are less or equal to 500 nucleotides away from each other. Gene blocks are blocks of genes that have at least a pair of neighboring genes. Orthoblocks are gene blocks that have at least a pair of neighboring genes that are homologous to genes in a gene block in the reference taxon’s genome. Event is a change in the gene blocks between any two species with homologous gene blocks. In this model, the evolution can be represented using 3 types of events: deletion event, duplication event, and split event. Given orthoblocks from different taxon, their pairwise events can be defined as:
1. Splits: If two genes in one taxon are neighboring and their homologs in the other taxon are
not, then there is a single split event.
2. Deletions: A gene exists in the gene block of one taxon, but its homolog cannot be found in
the other taxon.
3. Duplications: If there is a gene g in a gene block in the source genome, and homologous
genes g0, g00 together in a block of the target genome, then there is a duplication event. The
duplication has to occur in a gene block to be tallied. 7
An example that describes the gene block evolution using the event-based model appears in
Figure 2.1. The full list of pairwise events between all species is provided in Table 2.1. The orthoblocks from species A–E are arranged in a species phylogenetic tree in Figure 2.1
a b A
a b c B
a b c C
a b b' c D
a b b' c E
Figure 2.1: Species C has an experimentally determined operon (Black arrows), and serves as the reference taxon. The orthologs in species A, B, D, and E were determined as explained in the text. The events between C and all other species for this orthoblock are A–C: deletion (of gene c); B–C: split (of gene c), C–D: duplication (of b) and split (jagged line); C–E: duplication (of b).
For split event, duplication event, deletion event, the distance between any two orthologous gene blocks found in target organisms is defined as:
1. Split distance (ds): It is the absolute difference in the number of relevant gene blocks between the two taxa. For example, for the reference gene block with genes (abcdefg). Genome A has
blocks ((ac),(bdefg)), and genome B has ((ac),(bde),(fg)). Therefore, ds(A, B) = |2 − 3| = 1
2. Deletion distance (dd): is the symmetric difference of the number of orthologous genes between the different organisms. 8
Table 2.1: Pairwise events for the orthoblocks shown A-E
A B C D E A B split, deletion C deletion split D duplication, duplication, duplication, deletion, 2× split, split split E duplication, duplication, duplication split deletion split
3. Duplication distance (du): is the symmetric difference of the number of duplicated orthologous genes between two orthologous gene blocks, given that the orthologous gene is found in both
organisms.
Using the event distance, we can compute the all pair-wise distance and normalize it for every pair of orthologous gene blocks from a group of species for each event. This enables us to compare the relative conservation of gene blocks in bacteria.
2.2 Finding Orthologous Gene Blocks in Bacteria
In biology, the problem of finding conserved gene clusters has been an ongoing challenge. Since conserved gene clusters tend to share common functionality, detection of conserved gene clusters can help identify new operon functionality. Several methods and approaches were proposed to define and search for orthologous gene blocks. One approach identifies conserved gene clusters using the concept of a “run”, which is a set of genes that occur on the same strand, and the gaps between any pair of consecutive genes are less or equal to 300 bp (Overbeek et al., 1999). This notion paved the first step in connecting the biological knowledge with the mathematical formalization.
However, the 300bp distance was a pretty strict constraint regarding finding gene clusters. Later on, a graph-based algorithm was proposed to identify FRECS (functionally related enzyme clusters)
(Ogata et al., 2000). The main idea is to compare the order of genes in the genome and the cluster 9 of gene products in the pathway. By using a single linkage clustering algorithm, the approach can extract a set of enzymes that are encoded closely in the genome and catalyze successive steps in the metabolic pathway. This method exploited the local similarities between the metabolic pathway and the genome sequence. Although the method can infer conservation of gene cluster, it can not detect conserved gene cluster that does not encode successive reactions. Another popular approach is the gene team model (Luc et al., 2003). In this model, a team of genes of chromosomes A and
B are a set of genes that appear in both A and B. Furthermore, the gap between each pair of consecutive genes must be less or equal to a threshold θ. There are efficient algorithms to find gene teams in an arbitrary number of chromosomes for any threshold θ. This approach was later extended to handle spacial data to detect spatial gene clusters (Schulz et al., 2018). One drawback of this model is the strict gap constraint between any consecutive gene. Secondly, a gene team model does not consider gene duplication since the gene team is a set. On the other hand, the event-based model includes duplication events as orthoblocks are multiset of genes, which will be elaborated more in chapter4. In a sense, the event-based model is a more generalized model of the gene team model in terms of gap constraint and duplication events.
Using the event-based model, we can identify orthologous gene blocks for a group of species.
Previously, a heuristic approach was proposed to recover the maximum number of genes (minimize the deletion distance), prioritize gene blocks (minimize the split distance), and minimize the dupli- cation distance for two given orthologous gene blocks (Ream et al., 2015). Unfortunately, there is no formalization of the problem. This encouraged us to mathematically define the problem, and study its complexity in chapter4.
2.3 Ancestral Reconstruction of Gene Blocks in Bacteria
Another use of the event-based model is the reconstruction of ancestral gene blocks along the evolutionary tree. Ancestral reconstruction is the tracing of the ancestor features given the features of the individuals or populations. By recovering the state of the ancestral population, one can hypothesize the benefits of gaining or losing a feature, infer the environmental pressure that causes 10 adaptations, predicts the future tendency, etc. In biology, the reconstruction of the ancestral sequence, and the amino acid sequence of a protein have been studied extensively. The first formal study of ancestral reconstruction could be linked to (Pauling et al., 1963). Pauling and Zuckerkandl determined the common ancestral amino acid sequence of a group of homologous polypeptide chains using probability. In addition, the difference of amino acid at the same molecular site from two polypeptide chains could reveal which lineage causes the mutation. This study led to several ancestral studies later, such as the study of evolutionary divergence and convergence in protein (Zuckerkandl and Pauling, 1965); the study of ancestral reconstruction models: Maximum parsimony (Swofford and Maddison, 1987), Maximum likelihood (Yang et al., 1995), and Bayesian inference (Yang, 2007), the ancestral states of the structure of folder proteins (Harms and Thornton,
2010), and the evaluation of ancestral protein reconstruction methods (Williams et al., 2006).
The maximum parsimony principle is the idea that one should choose the hypothesis with fewest assumptions to make a prediction. Regarding the ancestral reconstruction field, parsimony chooses the ancestral states within a given tree so that they minimize the total number of evolutionary changes to explain the observed population at the leaf nodes. Maximum parsimony is one of the earliest formal studies of ancestral reconstruction. In character-based parsimony, each taxon is described by a set of characters. In addition, each character can be one in a finite number of states, and in one step, there are specific changes that are allowed for a set of characters. There are several variants of the parsimony model. The Fitch parsimony (standard parsimony) allows change between any two character states, and any changes count just 1 step. The Wagner parsimony only allows change that happens in a particular order. The Dollo parsimony suggests that a character is gained only one time and can never be regained if it is lost (Farris, 1977), etc. Overall, parsimony approaches are easier to implement and more efficient than the maximum likelihood approaches.
However, to compensate for its efficiency, parsimony does not account for variation in rates of evolution. In addition, by favoring the minimum evolutionary changes, parsimony might suffer an unrealistic course of evolution where changes happen often. Most importantly, it does not have a statistical model to estimate uncertainties. 11
The maximum likelihood goal is to find the best estimate of the unknown parameter values based on the observations. This statistical method allows us to measure the confidence of the ancestral reconstructions. In the simplest model of maximum likelihood, all characters have their independent state transitions, and their rate of change is constant over time. In reality, the model can be extended to allow variation in mutation rates by adding more parameters. Given the data at the children node, the maximum likelihood model computes the conditional probabilities of different ancestral reconstructions, and choose the reconstruction that has the highest conditional probability
(Yang et al., 1995). This can be best explained using Figure 2.2. Given a phylogenetic tree rooted at node x, observed data set at the leaf nodes X = {m : A, n : A, o : T, p : G}, parameters θ contains the mutation rates, and the branch length, we want to compute the likelihood of the observed data. For each inner node, there are 4 possibilities (A,G,T,C) to be the label. Therefore, there is a total of 43 = 64 possibilities of assignments. Consider a possible assignment in Figure
2.2b with x : A, y : A, z : T , we can compute its conditional probability as:
P (x : A) × P (y : A|x : A, txy) × P (z : T |x : A, txz) × P (m : A|y : A, tym) (2.1) ×P (n : A|y : A, tyn) × P (o : T |z : T, tzo) × P (p : G|z : T, tzp)
To calculate the likelihood of the observed data, we have to add all possible assignments. The objective of ancestral reconstruction under the maximum likelihood model is to find the assignment to all internal nodes that maximize the likelihood of the observed data for a given tree. One approach is to greedily maximize the likelihood between an ancestral node and its immediate descendants as we traverse the tree from leaf to root. This simplifies our computation problem; however, the likelihood might be worse than the global optimum. The second approach is to consider the joint combination, as I calculated in Equation 2.1, to find the optimal solution. Apparently, the latter approach is more time-consuming. Fortunately, there have been more efficient algorithms that tackle the joint likelihood recently (Pupko et al., 2000). Overall, maximum likelihood methods have greater accuracy than maximum parsimony methods, where variation in rates of evolution is the norm. However, it requires a transitional matrix between the character states, which is not readily available for the event-based model. 12
x:? x:A t t txy xz txy xz
y:? z:? y:A z:T
tym tyn tzo tzp tym tyn tzo tzp
m: A n:A o:T p:G m: A n:A o:T p:G
(a) (b)
Figure 2.2: Given a tree rooted at node x and the leaf nodes are labeled with nucleotides, the reconstruction problem finds nucleotide assignment for each inner node. Figure 2.2b is an example of such assignment.
Because of the simplicity of the maximum parsimony principle, we use it to explore the evolution of gene blocks in bacteria; however, the problems that we defined under maximum parsimony assumptions is already hard. One problem is to minimize the evolution cost between every pair of children nodes and parent node, and the other is to minimize the sum of all evolution cost of the phylogenetic tree. Then we apply these methods to several operons from two reference species
E. coli and B. subtilis. The results provide insights into the evolution of gene blocks and help us understand the functional intermediate state leading to the full operon. 13
CHAPTER 3. PRELIMINARIES
First, we introduce some important definitions of orthologous gene blocks. This helps formaliz- ing the evolutionary cost using the event-based model. Second, we present some typical phylogenetic tree definitions to define our labeling function, and some notions to help with the ancestral recon- struction algorithms that will be presented later in the thesis.
3.1 Event-based model
A reference taxon is a taxon where operons have been reliably identified by experimental means.
Such a taxon serves as a standard of truth to determine if the genes on a suspected orthoblock reside in an operon or a similar co-regulated gene block at least in one species.
An event is a single change in an orthoblock that is characterized as a split, deletion, or duplica- tion. Fig. 2.1 depicts an example of such events. The event-based cost between any two orthoblocks is then defined to be the minimum possible number of splits, duplications, and deletions required to explain the difference between them.
Given a reference operon O, we define G := {x1, x2, x3, ..., xn} to be the set of genes of O.A
λ1 λ2 λn gene block B over G is a non-empty multiset of G defined as B := {x1 , x2 , ..., xn } where xi ∈ G ,
λi ∈ N. We define the set of genes in gene block B as Gene(B) := {xi λi ≥ 1}, the duplication gene set of a gene block B as Dup(B) := {xi λi ≥ 2}, and the size of a gene block B as Size(B) P := λi.
The set of genes in gene block B are defined as Gene(B) := {xi λi ≥ 1}. We also define the duplication gene set of a gene block B as Dup(B) := {xi λi ≥ 2}. An orthoblock O is a set of blocks that is either empty or contains at least one gene block of a size larger than or equal to two. S We define the set of genes of O as Gene(O) := B∈O, and the set of genes that is duplicated in 14
S some gene blocks of O as Dup(O) := B∈O Dup(B). Given a gene block B and a gene set G over
λi G , we define B ∩ G := {xi ∈ B xi ∈ G}. Costs for orthoblocks are described as pairwise functions that calculate the event count from one orthoblock to another, as a proxy for their evolutionary distance. The costs between any two orthoblocks O and O0 are defined as follows.
1. The split cost, denoted as cs, is the absolute difference in the number of relevant gene blocks between the two taxa involved. We define Rel(O,O0) as the set of gene blocks from O where
each gene in each gene block has to appear in O0 at least once. Formally, Rel(O,O0) :=
S 0 B∈O(B ∩ Gene(O )). The split cost can be formalized as:
0 0 0 cs(O,O ) := Rel(O,O ) − Rel(O ,O) (3.1)
[ 0 [ = (B ∩ Gene(O )) − (B ∩ Gene(O)) (3.2) 0 B∈O B∈O
If the orthoblock O0 is the reference gene block, the split cost between two orthoblocks O and
R can be simplified as follows.
cs(R,O) = Rel(R,O) − Rel(O,R) (3.3)
[ 0 [ = (B ∩ Gene(O )) − (B ∩ Gene(R)) (3.4) 0 B∈R B∈O
[ 0 [ 0 = (B ∩ Gene(O )) − (B ∩ Gene(O )) (3.5) 0 B∈R B∈O
[ = 1 − B = 1 −|O| =|O| − 1 (3.6) 0 B∈O
For example, for the reference gene block R = (abcdefg), genome A has blocks O = ((ab), (def)).
We then compute the relevant gene blocks Rel(R,O) = (abdef) (removing genes c, g) and
0 Rel(O,R) = ((ab), (def)). Therefore, cs(O,O ) = |1 − 2| = 1. 15
2. The duplication cost, denoted as (cu), is the pairwise count of gene duplications between two orthoblocks. We define Dif(O,O0) to be the set of duplicated genes of gene block O, such
that these genes also appear in O0 but are not duplicated in O0. Formally, Dif(O,O0) :=
(Dup(O) ∩ Gene(O0)) \ Dup(O0). Here, our gene blocks are guaranteed to have at most one
duplication of each gene for each block. We formalize the duplication cost as follows.
0 0 0 cu(O,O ) := Dif(O,O ) + Dif(O ,O) (3.7)
0 0 = (Dup(O) ∩ Gene(O )) \ Dup(O )
0 + (Dup(O ) ∩ Gene(O)) \ Dup(O) (3.8)
If the orthoblock O0 serves as the reference gene block, we can simplify the duplication cost
between orthoblocks O and R as follows.
cu(R,O) = Dif(R,O) + Dif(O,R) (3.9)
= (Dup(R) ∩ Gene(O)) \ Dup(O)
+ (Dup(O) ∩ Gene(R)) \ Dup(R) (3.10)
= ∅ + (Dup(O) = (Dup(O) (3.11)
For example, for a reference gene block R = (abcde), the genome A has the gene block
O = (abbcc). Observe that the orthologs of genes Ob and Oc are duplicated in genome B. We
0 then compute Dif(R,O) = ∅ and Dif(O,R) = {b, c}. Therefore, cu(O,O ) = 0 + 2 = 2.
3. The deletion cost, denoted as cd, is the difference in the number of orthologs that are in the orthoblocks of the genome of one organism, or the other, but not in both. In other words,
the deletion cost is the symmetric difference between the set of orthologous genes of the two
gene blocks O,O0. We formalize the deletion cost as:
0 0 cd(O,O ) := |Gene(O) 4 Gene(O )| 16
If orthoblock O0 is the reference gene block, we can simplify the split cost between orthoblocks
O and R
cd(R,O) = Gene(R) 4 Gene(O) (3.12)
= Gene(R) − Gene(O) Gene(O) ⊆ Gene(R) (3.13)
For example, for a reference gene block R = (abcde), the genome A has gene block O = (abd),
and the deletion cost between orthoblocks R,O is cd(R,O) = |{a, b, c, d, e} 4 {a, b, d}| = 5 − 3 = 2.
3.2 Phylogenetic tree
Let T be a tree, E(T ) be the edges of tree T , and G be the set of genes of a reference operon.
We define Ω as the set of all possible orthoblocks over gene set G. Let λ : L(T ) 7→ Ω be the labeling of L(T ) (assign orthoblocks from Ω to the leaf nodes of T , this can include empty orthoblocks).
We define the function λˆ : V (T ) 7→ Ω to be an extension of λ on T if it coincides with λ on the leaves of T (assign an orthoblock to each node of T ). If λˆ(v) = O, we say that vertex v is labelled with orthoblock O. Given a labeling λˆ and an edge (u, v) ∈ E, we define the cost between the two labellings of the endpoints u, v as c(u, v) := c(λˆ(u), λˆ(v)) and the total cost function as ˆ P c(λ) := (u,v)∈E c(u, v). Furthermore, given a species tree T and a reference operon O, for node v ∈ V (T ), let O := λˆ(v), we define:
1. Ig(v): the identity function of gene g in O.
2. v.gene[g]: the set that represents whether to include gene g in O. There are only 3 possible
cases.
(a) v.gene[g] = {1} : this means that gene g has to be in O.
(b) v.gene[g] = {0} : this means that gene g can not be in O.
(c) v.gene[g] = {0, 1} : this means that gene g is either in O or not in v. 17
3. v.dup[g]: the set that represents the duplication status of gene g in O. There are only 3
possible cases.
(a) v.dup[g] = {1} : this means that gene g has to be duplicated in O.
(b) v.dup[g] = {0} : this means that gene g can not be duplicated in O.
(c) v.dup[g] = {0, 1} : this means that gene g can either be duplicated or not in O.
4. FREQ (v): The proportion of L(T ) that contains gene g (FREQ (v) = |u|u∈L(Tv)∧Ig(u)=1| ). g v g |L(Tv)|
5. DUPg(v): The proportion of L(Tv) that contains a duplication of gene g 18
CHAPTER 4. FINDING ORTHOLOGOUS GENE BLOCKS IN BACTERIA: THE COMPUTATIONAL HARDNESS OF THE PROBLEM AND NOVEL METHODS TO ADDRESS IT
In bacteria and archaea, gene blocks are sets of genes co-located on the chromosome, which are typically conserved, to some extent, between species. Operons can be viewed as a special case of gene blocks where genes are co-transcribed to polycistronic mRNA and are often associated with related functions, molecular complexes, or both. Such conserved gene blocks have been used in gene function prediction and phylogenetic analyzes (Enault et al., 2003; Overbeek et al., 1999; Srinivasan et al., 2005). Several interesting biological questions could be answered by studying the evolution of gene blocks: which components of the gene block tend to be more conserved? How did the gene block evolve? Given a gene block in a reference genome, which gene blocks from other taxa are orthologous to it?
Annotating sequences of bacterial genomes and determining whether gene blocks are orthologous is essential for our understanding of bacterial genomics and evolution. Recently we have developed a heuristic method to identify novel orthologous gene blocks (Ream et al., 2015). Motivated by this approach, we formalized the biological problem of finding orthologous gene blocks, which we named the Relevant Gene Block problem. In this study, we showed that the Relevant Gene Block problem is NP-hard. We then described a greedy approximation method and assess its accuracy.
We also formulated an Integer Linear Program for an exact solution, which is suitable for conducting smaller studies and can serve as a baseline for evaluating the greedy method. We demonstrated the exceptional scalability and accuracy of our greedy method through extensive comparative empirical studies. From this point on, we will refer the method in (Ream et al., 2015) as the Heuristic method, our method as the Greedy method, an Integer Linear Programming formulation as the ILP method. 19
In this chapter, we use the definition in Chapter3 to define the essential Relevant Gene Block
(RGB) problem. Given a reference gene block, and a set of homologous gene blocks in a target genome, this problem seeks orthologous gene blocks. We prove that this problem is inherently difficult, that is, NP-hard. Furthermore, we show that the RGB problem is unlikely to be efficiently approximated with a ratio bounded by a constant,meaning the problem is APX-hard. Despite these discouraging time complexity results, we describe an efficient greedy algorithm with an O(ln n) approximation for the RGB problem. In addition, we provide an ILP formulation of the RGB problem suitable for smaller sized gene-block studies. Finally, in a comparative study of empirical data, we demonstrate the outstanding performance of our proposed methods in terms of scalability and optimization.
4.1 The Relevant-Gene-Block problem
Under the assumption that the reference gene block is the true operon, our problem is to identify the orthoblocks in our target genome so that the overall cost for the three events is minimized.
Observe that this problem might be challenging since the event costs are not independent of each other because they depend on the set of genes of the orthoblock. First, we describe a mathematical formalization of the problem, which we refer to as the Relevant-Gene-Block (Deletion Duplication
Split) problem. Then we formulate restricted variations of this problem relevant for our time complexity analyzes that involve a reduction from the minimum set-cover problem.
Problem 1. Relevant-Gene-Block (deletion, duplication, split)
Instance: < R, O >, which are the reference gene block, and the set of homologous gene blocks in the target genome.
0 0 0 0 0 0 Find: < O > where O ⊆ O, so that c(R,O ) := cd(R,O ) + cs(R,O ) + cu(R,O ) is minimum.
Problem 2. Relevant-Gene-Block (deletion, split)
Instance: < R, O >, which are the reference gene block, and the cluster of homologous genes in the target genome. 20
0 0 0 0 0 0 Find: < O > where O ⊆ O, so that c(R,O ) := cd(R,O ) + cs(R,O ) = Gene(R) − Gene(O ) +
O0 − 1 is minimum.
Since Gene(R) is essentially the gene set of R, and one is a constant, we can reduce this problem
further into its final form suitable for our analysis.
Problem 3. Relevant-Gene-Block (deletion, split, simplified)
Instance: < R, O >, which are the set of genes of our reference block, and the collection of subset
of the reference gene set respectively.
Find: < O0 > where O0 ⊆ O, so that f(S, O0) := |S| − | ∪ O0| + |O0| is minimum.
We state the minimum set cover problem required for our reduction.
Problem 4. Minimum-Set-Cover
Instance: < R, O >, which are the set of genes of our reference block, and the collection of a subset
of the reference gene set respectively.
Find: < O0 > where O0 ⊆ O, so that ∪O0 covers S, and |O0| is minimum
From now on, we use MSC to stand for the Minimum-Set-Cover problem, and RGB for Relevant-
Gene-Block (deletion, split, simplified) problem.
4.2 The RGB problem is Hard
First, we prove the NP-hardness of the RGB problem by a reduction from the MSC problem.
Our reduction relies on the essential property that the solution of the RGB problem should be a
minimum set covering itself. Using this property, we introduce three lemmas to show the reduction.
Second, we prove that our problem is equivalent in cost to the MSC problem and describe a greedy
algorithm to provide an O(ln n) approximation of the RGB problem. Finally, we describe an ILP
formulation of the RGB problem.
For the reduction, we are going to take an instance of the MSC problem and reduce it to an
RGB instance that has a solution if and only if the MSC instance has a solution. Given an instance
< S, C > of MSC where S is the set of ground elements and C is a collection of a subset of S, we 21 construct the same instance < S, C > for RGB problem, where S is the set of genes and C is the collection of a subset of the reference gene set. Observe that ∪C = S.
Lemma 1. If the set C0 is a solution to our RGB instance < S, C >, then C0 is the minimum set cover of the set ∪C0.
Proof. Assume the contrary: ∃C∗ ⊆ C, ∪C∗ = ∪C0, |C∗| < |C0|. We then have the equality: f(S, C∗) = |S|−|∪C∗|+|C∗| = |S|−|∪C0|+|C∗|. Since |C∗| < |C0|, f(S, C∗) < |S|−|∪C0|+|C0| = f(S, C0). Hence, f(S, C∗) < f(S, C0). This contradicts with C0 is the solution to our RGB .
Therefore, C0 is the minimum set cover of the set ∪C0.
Lemma 2. If the set C0 is a solution to our RGB instance < S, C >, then ∀c ∈ (C \C0), |c\C0| ≤ 1.
Proof. Assume the contrary: ∃c∗ ∈ (C \ C0), |c∗ \ C0| ≥ 2. Consider C∗ = C0 ∪ {c∗} , we have the equality:
f(S, C∗) = |S| − | ∪ C∗| + |C∗|
= |S| − | ∪ (C0 ∪ {c∗})| + |C0 ∪ {c∗}|
= |S| − | ∪ C0| − |c∗ \ C0| + |C0| + 1
≤ |S| − | ∪ C0| − 2 + |C0| + 1 |c∗ \ C0| ≥ 2
= |S| − | ∪ C0| + |C0| − 1
= f(S, C0) − 1
Hence, f(S, C∗) < f(S, C0). This contradicts with C0 is the solution to our RGB . Therefore,
∀c ∈ (C \ C0), |c \ C0| ≤ 1.
Lemma 3. If the set C0 is a solution to our RGB instance < S, C >, ∀c ∈ (C \ C0) and |c \ C0| =
0 0 ∗ 1,C1 := C ∪ {c} is the minimum set cover of ∪(C ∪ {c })
Proof. The proof is straightforward. As long as we add the subset that contributes exactly one to our cover so far, we increase the number of block by 1, and decrease the number of element to cover by 1. Therefore, f(S, C1) does not change and is still the minimum. 22
Algorithm 1: Constructing solution of MSC given solution of RGB Input: C: a collection of subset, C0: a solution of RGB Output: C1: solution of MSC 1 C ← C \ C0 0 2 C1 ← C 3 for c ∈ C do 4 if c \ C1 == 1 then 5 C1 ← C1 ∪ {c}
6 return C1
Claim 1. If the set C0 is a solution to our RGB instance < S, C >, then we can construct the set
C1 that is a solution to MSC instance < S, C > using Algorithm 1 in polynomial time.
Proof. Since C0 is a solution to the RGB instance < S, C >, according to Claim 1, C0 is a minimum set cover of the set ∪C0. By Claim 2, any subset c in the collection of subset C, but not in C0, can contribute at most 1 more element to our set C0. Hence, if we include all of the subsets c that contribute one extra element to our set C1 as in the for loop of Algorithm 1, we eventually cover all of the elements in the set S. Now, using Claim 3, our new set C1, after including each of such subsets c maintains the invariant that: C1 is the minimum set cover of ∪C1. Therefore,
C1 is a minimum set cover of S, and C1 is a solution for the MSC instance < S, C >. We observe that Algorithm 1 runs in polynomial time, as the for loop terminates in at most |C| iterations.
0 Therefore, given that C is a solution to our RGB instance < S, C >, we can construct C1 that is a solution to the MSC instance < S, C > using Algorithm 1 in polynomial time. In addition, the solution to RGB has the same cost as the solution to MSC.
Claim 2. If the set C0 is a solution to our MSC instance < S, C >, then C0 is also a solution to
RGB instance < S, C >.
Proof. Since C0 is a solution to our MSC instance, we have f(S, C0) = |S|−|∪C0|+|C0| = 0+|C0| =
|C0|. 23
Assume the contrary, ∃C∗ ⊆ C, C∗ is a solution to the RGB instance and f(S, C∗) < f(S, C0).
∗ Now, we can use Algorithm 1 to construct C1 that is a set cover of S and f(S, C1) = f(S, C )
(from the Claims 1,2 and 3). This provides the following property: f(S, C1) = |S| − | ∪ C1| + |C1| =
∗ 0 0 ∗ 0 |S|−|S|+|C1| = |C1| = f(S, C ). Therefore, together with f(S, C ) = |C | and f(S, C ) < f(S, C ),
∗ 0 ∗ 0 we have f(S, C ) < |C |. Since we have just proven that f(S, C ) = |C1|, this means that |C1| < |C |
0 and C1 is a set cover of S. Hence, we can construct a set cover C1 that is smaller than that of C . This contradicts our assumption that C0 is a solution to our MSC instance (i.e., C0 is the minimum set cover of S). By proof of contradiction, C0 is also a solution to the RGB instance < S, C >.
Furthermore, the solution to MSC has the same cost as the solution to RGB.
Both, the NP-hardness and the APX-hardness of the RGB problem follow from the previous claims.
Theorem 1. The RGB problem is NP-hard
Proof. From Claim 1 and Claim 2, we conclude that the MSC problem reduces in p-time to the RGB problem. Since the MSC problem is NP-Hard, it follows that the RGB problem is NP-Hard.
Theorem 2. RGB problem is APX-hard within .72 ln n
Proof. We observe that a solution of MSC and a solution of RGB are of equal cost. Therefore, our reduction is also an approximation-preserving reduction. Henceforth, RGB is hard to approximate within .72 ln n.
4.3 Solving the RGB problem
With the NP-hardness result in hand, we know that there should not be an efficient problem to solve the problem exactly. We also showed that the problem is APX-Hard, hence, RGB is unlikely to have an efficient approximation algorithm. However, our reduction shows that the solution of
MSC problem has the same cost as the solution of RGB. Therefore, we devise a greedy algorithm for the RGB problem as follows. 24
4.3.1 Greedy Algorithm
Algorithm 2: Greedy Algorithm to solve RGB problem Input: S: gene set of the reference gene block, C: set of orthologous gene block Output: C0: solution of RGB 1 S0 ← ∪C 2 C0 ← {} 3 while S0 6= ∅ do 4 Select c ∈ C that maximizes |c ∩ S0| 5 S0 ← S0 \ c 6 C0 ← C0 ∪ c
7 return C0
We will show that our algorithm is a ln n approximation of our RGB problem. In order to do so, we need the following lemma.
Lemma 4. Let Copt be the optimal result of our RGB instance < S, C >, and C∗ be the optimal result of the MSC instance < S0, C > as in Algorithm2. We will show that:
f(S, Copt) = |S \ ∪C| + f(S0,C∗)
Proof. Following the Lemma1, Copt is the minimum set cover of ∪Copt. Then, using Algorithm
0 opt 1, we can build a set C1 which is the minimum set cover of S = ∪C so that f(S, C ) = f(S, C1).
0 ∗ Hence, C1 is also an optimal result of the MSC instance < S , C >, which means |C1| = |C |. We 25
opt starts expanding the equality f(S, C ) = f(S, C1) as following:
opt f(S, C ) = f(S, C1)
= |S| − | ∪ C1| + |C1|
= |S| − | ∪ C| + |C1| C1 is a set cover of ∪C
= |S \ ∪C| + |C1|
0 ∗ ∗ 0 = |S \ ∪C| + |S | − | ∪ C | + |C1| C covers S
0 ∗ ∗ ∗ = |S \ ∪C| + |S | − | ∪ C | + |C | |C1| = |C | as shown above
0 0 0 ∗ = |S \ ∪C| + f(S ,C∗) f(S ,C1) = f(S ,C )
Therefore, the proof is concluded.
Theorem 3. The greedy algorithm runs in polynomial time and provides an O(ln n) - approximation of our RGB problem, where n is the cardinality of the set S.
Proof. Firstly, we use the result from this paper (Chvatal, 1979). Given C∗ is the optimal solution to MSC of instance S0,C, and C0 is the output of our alogrithm2 we have:
f(S0,C0) ≤ (1 + ln n)f(S0,C∗)
↔ f(S0,C0) ≤ (1 + ln n)(f(S, Copt) − |S \ ∪C|) From Lemma 4
↔ f(S0,C0) ≤ (1 + ln n)(f(S, Copt − δ)) We can treat |S \ ∪C| as a constant
↔ f(S0,C0) ≤ (1 + ln n)f(S, Copt) δ ≥ 0
Let |S| = n, |C| = m, line 1 and 2 in the algorithm take linear time in n. Regarding the loop, each time we go through the set of orthologous gene block to find the best gene block. Hence, the while loop takes O(mn). The claimed statement follows.
4.3.2 Integer Linear Programming
Given an instance S, C of the RGB problem, we define xi for every set ci ∈ C that takes a value of 1 if we include ci in our answer, and a value of 0 otherwise. We can express our RGB problem 26 as the following integer linear program:
n X minimize xi (4.1) i=1 X subject to xi ≥ 1 ∀g ∈ ∪C (4.2)
i,g∈ci
xi ∈ {0, 1} ∀g ∈ S (4.3)
The size of this version of ILP is the same as the size of ILP version of typical MSC problem, which is less or equal to |C| + |S| ∗ |C|
Theorem 4. Let Copt be the optimal result of our RGB, and C∗ be the optimal result of the ILP formulation above. We can show that:
f(S, Copt) = |S \ ∪C| + f(S0,C∗)
Proof. Since C∗ is an optimal solution for ILP, it is also a solution to an instance < S0, C > of
MSC. Using lemma 4, we can conclude the proof.
4.4 Results and Discussion
Here we evaluate the performance of the Greedy method in approximating the RGB prob- lem. The performance is analyzed by comparing our algorithm with the previously best-known approach, the Heuristic method (Ream et al., 2015), and our exact ILP solution as described in
Section 4.3.2. All of our studies are performed using the previously used benchmark data-set (Ream et al., 2015), which we describe first. In our comparative studies, we first analyze the run-times
(see Section 4.4.1), then the accuracy (see Section 4.4.2), and finally conclude the studies (see
Section 4.4.3). We run all our experiments on a Dell Laptop, Intel Core i7 8th Gen, 15 GB DDR4
RAM running Ubuntu 18.04.4 LTS. The code was written in Python 3.5.
We used the experimentally-identified operons from the E. coli K-12 genome as the gold stan- dard. We chose E. coli because of the high quality of annotation, and a large number of experi- mentally verified operons. The genomes in which we looked for orthoblocks were taken from (Fani 27 et al., 2005). Previously the Heuristic method has been used on this group of taxa (Ream et al.,
2015), and the 55 operons were chosen based on the criteria: each operon comprised of at least 5 genes.
4.4.1 Scalability Study
We compared the run-time of the Greedy method, the Heuristic method, and the ILP approach, using data sets ranging from input sizes of 5 to 13 genes per operon.
4.4.1.1 Experimental Settings.
We compare the running time of 3 methods. In each analysis, we ran the three methods on the same data set (55 gene blocks) and recorded their running time in log 10 (seconds).
4.4.1.2 Results and Discussion.
Number of genes in the operon Naive 3 Approximated ILP 4
5
6 Running time (log10) 7
8 fli liv lsr srl flg lpt thi rpl his trp thr cai glc gcl yia ivb fec ula ast leu fuc glg pst nsr rbs rnc gat atp cys ssu cas cyo yce hyc psp sdh hca chb hya paa hyp nan dpp puu mdt bam men yje (9) tdc (7) ybg (5) ydh (6) ydg (8) hyf (12) yrb (11) nuo (13) Operon Name
Figure 4.1: The x-axis is the operon/ gene block names, we sort the gene blocks in ascending order according to the number of genes in each block. From left to right: operon ast to operon ygb has 5 genes in the gene block, operon cai to operon ydh has 6 genes in the gene block, operon cas to operon tdc has 7 genes in the gene block, operon bam to operon ydg has 8 genes in the gene block, operon atp to operon yje has 9 genes in the gene block, operon paa to operon yrb has 11 genes, operon hyf has 12 genes, operon nuo has 13 genes. The y-axis is the running time of the method in a log10-scale second 28
Overall, as shown in Figure 4.1, we can see that ILP takes the longest time to finish for all the operons except cai, mdt, rbs, paa. The Greedy method is the fastest without exception, beating the Heuristic method by a large margin. On average, the running time of the Greedy method is
2.9 × 10−7 seconds, of the Heuristic method is 8.8 × 10−5 seconds, and of the ILP method is 10−4 seconds. This means that the Greedy method should be around 100 times faster than the Heuristic method, and 1,000 times faster than the ILP method. This is expected because the Heuristic method tries to generate almost all possible combinations of gene blocks, and the ILP method solves the problem exactly. To our surprise, the Heuristic method actually performs slowest in the 4 cases of cai, mdt, rbs, paa. The reason is not due to those operons have more genes, but because they have more potential orthologous gene blocks. Although the Heuristic method only takes milliseconds to finish in the worst case, we only surveyed 55 operons within a group of 33 taxa. Currently, NCBI has more than 26,000 bacterial genomes. The running time will become important when running on more complex operons and a larger number of species. In sum, the
Greedy method performs the best in terms of run time.
4.4.2 Accuracy Study
We evaluate the methods in terms of the event-based cost function. That is, we say that the method that can reconcile the blocks with the reference operon using the smallest number of split and deletion events is more accurate. The reasoning is that a lower cost is the most parsimonious explanation for the evolutionary distance between any two orthoblocks as explained in (Nguyen et al., 2019). This is somewhat analogous to the process of pairwise protein sequence alignment by assigning a cost function based on the costs of indels and substitutions.
4.4.2.1 Experimental Settings
We generated the best orthologous gene block in each of the 33 species using three different methods. We then calculated the total number of deletion events and split events for each operon for each model and present it in Figure 4.2a and Figure 4.2b. Then, for each model, we calculated 29 the sum of deletion events and split events for each operon (Figure 4.2c) that represents our cost function. The method is considered better if the number of events is lower.
4.4.2.2 Results and Discussion.
As shown in Figure 4.2a, in most cases, the three models have the same number of deletion events. In 18 out of 55 cases, the Greedy method has a smaller number of deletion of events than the Heuristic method, and they have the same amount of deletion events in the rest of the cases.
In all cases, ILP has the lowest number of deletion events. As shown in Figure 4.2b, the three models have the same number of split events in most cases. However, in 13 of 55 operons, the
Heuristic method has a lower number of split events than the Greedy method. In 3 of 55 operons, the Greedy method has a lower number of splits than the Heuristic method. Figure 4.2c conveys the cost function (sum of deletion and split events) for each of the methods. In 15 of 55 operons, the Greedy method has a lower cost than the Heuristic method. The Greedy and Heuristic methods perform similarly on the other operons. In all of the cases, the ILP method always has the lowest cost, as it is an exact method by design. Following that, the Greedy method is more accurate than the Heuristic method in 15 of 55 operons. 30
250 Number of genes in the operon Naive 200 Approximated ILP s t
n 150 e v E n o i t 100 e l e D 50
0 i i i l l l t t t t ) t ) ) ) ) r ) ) ) r r s s s s c c c c c v a a a a o p g e a n u b p p g p b p n h u u r i l c i a l i l s s fl p h s s m a d p 7 t 3 r 6 h 8 2 5 9 1 y a y l s s e c b u c n y y l s v a y a e p e fl u h d l t r ( ( ( ( h c t t ( g g y i a f u f l g p n r a r g a c 1 s c 1 1
c y h p s h c h h p n d p m ( ( ( m b c e h g g j f d o d d b b y t y r y y y u h y n Operon Name (a)
80 Number of genes in the operon Naive Approximated 60 ILP
40 SplitEvents 20
0 fli liv lsr srl flg lpt thi rpl his trp thr cai glc gcl yia ivb fec ula ast leu fuc glg pst nsr rbs rnc gat atp cys ssu cas cyo yce hyc psp sdh hca chb hya paa hyp nan dpp puu mdt bam men yje (9) tdc (7) ybg (5) ydh (6) ydg (8) hyf (12) yrb (11) nuo (13) Operon Name (b)
Number of genes in the operon 300 Naive 250 Approximated ILP 200
150
100
50 Split events + Deletion
0 fli liv lsr srl flg lpt thi rpl his trp thr cai glc gcl yia ivb fec ula ast leu fuc glg pst nsr rbs rnc gat atp cys ssu cas cyo yce hyc psp sdh hca chb hya paa hyp nan dpp puu mdt bam men yje (9) tdc (7) ybg (5) ydh (6) ydg (8) hyf (12) yrb (11) nuo (13) Operon Name (c)
Figure 4.2: The x-axis is the same as Figure 2. The y-axis is the number of event count for the orthologous gene block in each species compared to the reference gene block. In Figure 3a,3b,3c, a dot represent the deletion events, split events, sum of deletion events, and split events respectively. Red dots represent the Heuristic method. Blue dots represent the Greedy method. Green dots represent the ILP method.
4.4.3 Concluding Discussion
From our scalability and accuracy study, we observe that the Greedy method is the fastest, being 100× faster than the heuristic method, and 1000× faster than the exact ILP method. In 31 terms of accuracy, the Greedy method follows the ILP method very closely and either outperforms the Heuristic method or has the same results for all of the 55 operons. Therefore, on our small dataset that was used in (Ream et al., 2015), our Greedy method is outperforming the Heuristic method in both scalability and accuracy.
4.5 Conclusion
Finding orthologous gene blocks is an essential step in understanding the evolution of gene blocks and complexity in bacterial genomes. In this study we formally define the problem of identifying orthologous gene blocks given a reference operon, prove that it is NP-hard, and present an algorithm that guarantees O(ln n) approximation and runs in polynomial time. In addition, we designed an
ILP formulation and proved that it could solve our problem exactly. In our experimental study, we demonstrated that the Greedy method performs better than the Heuristic method in terms of both accuracy and scalability. We note that the methods developed in this work cannot handle the duplication events. While those are relatively rare, handling duplications would be discussed in chapter7 32
CHAPTER 5. TRACING THE ANCESTRY OF OPERONS IN BACTERIA
Complexity is a fundamental attribute of life. Complex systems are made of parts that together perform functions that a single component, or subsets of components, cannot. Ancestral recon- struction is essential in the evolution history of biological characteristics of complex systems. The reconstruction can provide a peek into how complex systems evolve and why the systems develop in a specific way. In this chapter, we focus on the problem of reconstructing ancestral states of gene block in bacteria. Reconstructing plausible ancestral states of extant gene blocks and operons can help us understand how gene blocks evolve, identify possible functional intermediate states, and determine which forces might affect their evolution.
We use the definition in Chapter3 to define the important Ancestral-Reconstruction-Gene-
Block (ARGB) problems. Given a reference gene block, a phylogenetic tree where each leaf node is populated a the homologous gene blocks, this problem seeks a labeling of gene block for each of the inner nodes. We introduce two versions of the problem and present their respective algorithms.
In the result section, we conduct several studies on the operons of E. coli and B. subtilis as well as form hypotheses on the evolution of operons of E. coli and B. subtilis.
5.1 The Ancestral-Reconstruction-Gene-Block problem
Our problem is to find the labeling that minimizes the overall cost for the three evolutionary events. We now present the local ARGB problem and the global ARGB problem
Problem 5. Ancestral-Reconstruction-Gene-Block (local)
Instance: < T, G, λ >, which are the phylogenetic tree, the set of genes in the reference gene block, and the labeling of L(T )
Find: < λˆ > so that for every node u and its immediate children u1, u2, c(u, u1) + c(u, u2) is minimum. 33
Problem 6. Ancestral-Reconstruction-Gene-Block (global)
Instance: < T, G, λ >, which are the phylogenetic tree, the set of genes in the reference gene block, and the labeling of L(T )
Find: < λˆ > so that c(λˆ) is minimum
5.2 Solving the ARGB problems
Here, we explore two related maximum parsimony heuristic approaches, local and global, to reconstruct the ancestral gene blocks. In each approach, we will present an algorithm, and an optimality analysis.
5.2.1 Local Algorithm
Briefly, the local approach focuses on finding the optimal parent ancestral gene block given its children gene blocks. For each internal node u, let u1 and u2 be its two children. The intuition is that we have to include a gene in the parent if both of the children have it. However, greedily propagating an included gene up a tree may cause predicting its ancestral existence into deeper internal nodes than is warranted. To check this problem, at each tree vertex v, for each gene g, we introduce a correction by checking the fraction of the leaf nodes that contain g. Since gene loss tends to happen more often than gene gain, we use a threshold of 0.5 to indicate whether v contains gene g or not. To that end, we developed a greedy local optimization algorithm3 34
5.2.1.1 Algorithm
Algorithm 3: Local approach Input: T,G,λ
Result: λˆ
1 for internal node u when traverse T in post-order do
2 Let u1, u2 be the children of u
3 Let O1,O2 be the orthoblock assigned to u1, u2
4 Let initial := O1 ∪ O2
5 Remove genes in initial that has FREQg(u) <= .5
6 Remove duplicated genes in initial that has
DUPg(u) <= .5
7 λˆ(u) := initial
8 Return λˆ
5.2.1.2 Optimality
Let λˆ be the result of our Algorithm 3 with input (T, G, λ). For each u ∈ I(T ), let u1, u2 be ˆ ˆ ˆ its children and O := λ(u),O1 := λ(u1),O2 := λ(u2). We will show that cd(O,O1) + cd(O,O2) and cu(O,O1) + cu(O,O2) are local minimal.
Lemma 5. ∀g ∈ G, if FREQg(u) > .5 then either FREQg(u1) > .5 or FREQg(u2) > .5
Proof. We assume the contradiction, which means that FREQg(u1) ≤ .5 and FREQg(u2) ≤ .5, then
|L(u1)| |{v ∈ L(u1)|g ∈ Gene(λ(v)}| ≤ 2 |L(u2)| |{v ∈ L(u2)|g ∈ Gene(λ(v)}| ≤ 2
Define H := {v ∈ (L(u1) ∪ L(u2))|g ∈ Gene(λ(v)}, from the two inequalities above, we have: 35
|L(u1)| |L(u2)| H ≤ + 2 2
Since u1, u2 are the children of u, then L(u1) ∪ L(u2) = L(u)
L(u1) ∩ L(u2) = ∅
|L(u)| → {v ∈ L(u)|g ∈ Gene(λ(v)} ≤ 2
→ FREQg(u) ≤ .5
By contradiction, either FREQg(u1) > .5 or FREQg(u2) > .5
Lemma 6. ∀g ∈ G, if FREQg(u) ≤ .5 then either FREQg(u1) ≤ .5 or FREQg(u2) ≤ .5
Proof. The proof is the same as for Lemma5
0 0 Lemma 7. ∀g ∈ G and ∀O ∈ Ω, if g ∈ Gene(O) and g∈ / Gene(O ), then |Ig(O) − Ig(O1)| +
0 0 |Ig(O) − Ig(O2)| ≤ |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)| .
Proof. Since g ∈ Gene(O), then FREQg(u) > .5. Therefore, FREQg(u1) > .5 or F REQg(u2) > .5
(by Lemma5). Hence, g ∈ Gene(u1) or g ∈ Gene(u2). Consider 3 cases:
1. If u1 and u2 contain g, then
|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |1 − 1| + |1 − 1| = 0
0 0 |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)| = |0 − 1| + |0 − 1| = 2
0 0 Therefore, |Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| < |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|
2. If only u1 contains g, then
|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |1 − 1| + |1 − 0| = 1
0 0 |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)| = |0 − 1| + |0 − 0| = 1
0 0 Therefore, |Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)| 36
3. If only u2 contains g, then
|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |1 − 0| + |1 − 1| = 1
0 0 |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)| = |0 − 0| + |0 − 1| = 1
0 0 Therefore, |Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|
0 From the above cases, we conclude that |Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| ≤ |Ig(O ) − Ig(O1)| +
0 |Ig(O ) − Ig(O2)|
0 0 Lemma 8. ∀g ∈ G, and ∀O ∈ Ω, if g∈ / Gene(O) and g ∈ Gene(O ), then |Ig(O) − Ig(O1)| +
0 0 |Ig(O) − Ig(O2)| ≤ |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)| .
Proof. Since g∈ / Gene(O), then FREQg(u) < .5. Therefore, FREQg(u1) < .5 or F REQg(u2) < .5
(by lemma 1. Hence, g∈ / Gene(u1) or g∈ / Gene(u2). Consider 3 cases:
1. If u1 and u2 do not contain g, then
|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |0 − 0| + |0 − 0| = 0
0 0 |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)| = |1 − 0| + |1 − 0| = 2
0 0 Therefore, |Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| < |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|
2. If only u1 does not contain g, then
|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |0 − 0| + |0 − 1| = 1
0 0 |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)| = |1 − 0| + |1 − 1| = 1
0 0 Therefore, |Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|
3. If only u2 does not contain g, then
|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |1 − 1| + |1 − 0| = 1
0 0 |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)| = |0 − 1| + |0 − 0| = 1
0 0 Therefore, |Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|
0 From the above cases, we conclude that |Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| ≤ |Ig(O ) − Ig(O1)| +
0 |Ig(O ) − Ig(O2)| 37
We can now state our theorems regarding the deletion costs and duplication costs
Theorem 5. Let λˆ be the result of our Algorithm3 with input (T, G, λ), we define I(T ) be the inner node of tree T , and T (u),L(u) are the subtree rooted at any given node u, and the set of leaf nodes of the subtree at node u. For each u ∈ I(T ), let u1, u2 be its children. We claim that cd(O,O1) + cd(O,O2) is minimum
0 0 0 Proof. Given O ∈ Ω, we will show that cd(O ,O1) + cd(O ,O2) ≥ cd(O,O1) + cd(O,O2) We have:
0 0 X 0 0 cd(O ,O1) + cd(O ,O2) = (|Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|) g
X 0 0 = (|Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|) + g∈O0
X 0 0 (|Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|) g∈ /O0
X 0 0 = (|Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|) + (5.1) g∈O0,g∈O
X 0 0 (|Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|) + g∈O0,g∈ /O
X 0 0 (|Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|) + g∈ /O0,g∈O
X 0 0 (|Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|) g∈ /O0,g∈ /O 38
On the other hand, we have:
X cd(O,O1) + cd(O,O2) = (|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)|) g X = (|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)|) + g∈O X (|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)|) g∈ /O X = (|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)|) + (5.2) g∈O0,g∈O X (|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)|) + g∈O0,g∈ /O X (|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)|) + g∈ /O0,g∈O X (|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)|) g∈ /O0,g∈ /O
From Lemma7, and Lemma8, we have:
0 0 cd(O,O1) + cd(O,O2) ≤ cd(O ,O1) + cd(O ,O2)
Theorem 6. Let λˆ be the result of our Algorithm3 with input (T, G, λ), we define I(T ) be the inner node of tree T , and T (u),L(u) are the subtree rooted at any given node u, and the set of leaf nodes of the subtree at node u. For each u ∈ I(T ), let u1, u2 be its children. We claim that cu(O,O1) + cu(O,O2) is minimum
Proof. The proof follows the same logic as deletion costs, except this time, we consider DUPg(u) instead of FREQg(u)
5.2.2 Global Algorithm
Here we try to achieve minimal deletion and duplication costs globally. Intuitively, for each node v and each gene g in the reference operon, we decide whether gene g could appear in the 39 orthoblock that we will assign to v. To do this, we use dynamic programming. By traversing the phylogenetic tree bottom-up and top-down, we determine the occurrence of each gene in the reference operon for each node v. We also determine whether a gene should be duplicated in the same manner. For split costs, we generate the relevant gene blocks of the two children given the set of genes to be included.
5.2.2.1 Algorithm
Algorithm 4: Global approach Input: T,G, λ
Result: λˆ
1 for each gene g in G do
2 Determine whether to include gene g in gene set of each
internal node u using Fitch algorithm
3 Determine whether to include gene g in duplicated gene
set of each internal node u using Fitch algorithm
4 for internal node u when traverse T in post-order do
5 Let u1, u2 be the children of u
6 Let O1 := λ(u1), O2 := λ(u2)
7 if |Rel(O1, Gene(u))| < |Rel(O2, Gene(u))| then
8 initial := Rel(O1, Gene(u))
9 else
10 initial := Rel(O2, Gene(u))
11 λˆ(u) := initial
12 Return λˆ 40
5.2.2.2 Optimality
ˆ X ˆ ˆ Let λ be the result of our Algorithm4 with input ( T, G, λ). We will show that cd(λ(v1), λ(v2))
(v1,v2)∈E(Tu) X and cu(λˆ(v1), λˆ(v2)) are the global minimum for subtree Tu.
(v1,v2)∈E(Tu) X Lemma 9. ∀g ∈ G, ∀u ∈ I(T ), Ig(λˆ(v1)) − Ig(λˆ(v2)) is minimum
(v1,v2)∈E(Tu) Proof. Any node u ∈ I(T ) can only be of 3 types: the node that has children that are both leaf nodes, the node that has one child as a leaf node, and one as an inner node, the node that has two children that are both inner node. We denote them as type 1, type 2, type 3, respectively. We can then prove the lemma by induction as we traverse the tree T in post-order. The proof is similar to the proof of the Fitch algorithm, as we consider the value of each node can only take a value of 0 or 1.
Theorem 7. Let λˆ be the result of our Algorithm 4 with input (T, G, λ), we define I(T ) be the inner node of tree T , and T (u),L(u) are the sub tree rooted at any given node u, and the set of leaf X ˆ ˆ nodes of the subtree at node u. We will show that cd(λ(v1), λ(v2)) is global minimum
(v1,v2)∈E(Tu) for subtree Tu
Proof. Indeed, as we can rewrite the objective function as:
X ˆ ˆ X X ˆ ˆ cd(λ(v1), λ(v2)) = Ig(λ(v1)) − Ig(λ(v2))
(v1,v2)∈E(Tu) g∈G (v1,v2)∈E(Tu)
Because each gene g ∈ G are independent of each other, we can add each individual sum and have the minimum.
Since our theorem holds for every inner node, it also holds for the root. Therefore, the global approach provides optimal deletion costs. Similarly, we can prove the same for duplication costs. 41
5.3 Results and Discussion
We used experimentally-identified operons from E. coli K-12 and B. subtilis str. 168 genomes as gold standards for deriving operons from Gram-negative and Gram-positive bacteria, respectively.
The reason we chose these two species is that they both have well-annotated genomes, including experimentally verified and functionally annotated operons.
5.3.1 Operons using Escherichia coli as a reference taxon
We chose E. coli as the reference species for proteobacteria, a major group of Gram-negative bacteria. Our selection resulted in a set of proteobacteria species comprising three -proteobacteria, six α-proteobacteria, seven β-proteobacteria, and 17 γ-proteobacteria, including E. coli. These taxa include two γ-proteobacteria insect endosymbionts: Buchnera aphidicola and Candidatus Blochma- nia. These two species have unusually small genomes due to their endosymbiotic nature and display massive gene loss. We reconstructed ancestors for the following operons from E. coli (described below): atpIBEFHAGD and paaABCDEFGHIJK.
atpIBEFHAGDC . The atpIBEFHAGDC operon codes for F1Fo-ATPase, which catalyzes the synthesis of ATP from ADP and inorganic phosphate (Kasimoglu et al., 1996). ATP synthase is composed of two fractions: F1 and Fo (Senior, 1990). The F1 fraction contains the catalytic sites, and its proteins are coded by five genes (atpA, atpC, atpD, atpG, atphH ) (Senior, 1990). The Fo complex constitutes the proton channel, and its proteins are coded by three genes atpF, atpE, atpB. atpI is a non-essential regulatory gene. Figure S3 shows the high degree of conservation of this operon.
Figure 5.1 shows ancestral reconstruction using the global maximum parsimony algorithm. Both local (Fig S1) and global reconstructions show the consistency of having orthoblocks atpACDGH and atpBF in the most common ancestors for different Gram-negative bacteria. This finding agrees with the long-standing hypothesis that the Fo and the F1 fractions have evolved separately, with the two fractions having homologs in the hexameric DNA helicases and with flagellar motor complexes.
Although we find the gene atpI in several species, the reconstruction predicts that atpI is not in 42
ε-protebacteria
α-protebacteria
β-protebacteria
γ-protebacteria
Figure 5.1: Ancestral reconstruction of operon atpIBEFHAGDC using the global optimization approach.The lower-case letters in each tree node represent the genes in the orthoblock (e.g. “a” represents “atpA”, see legend in blue bar, top). A ‘|’ designates a split (i.e. a distance ≥500bp between the genes to either side of the ‘|’). The green bar on top shows the total number of events that took place in this reconstruction. The numbers in the brackets in the inner nodes are a 3-tuple showing the cumulative count of events going from the leafnodes to the tree root in the following order: [deletions, duplications, splits]. No orthologus gene blocks were found in species labeled with an asterisk (∗). The reference genome E. coli is marked with a box. These naming and color conventions persist through this study. 43 the same cluster with other genes. As stated, atpI is probably not an essential component of the
F1Fo ATPase (NJ, 1984). Another interesting finding is the duplication of atpF in -proteobacteria, which appears to predate their common ancestor. Note that all genes exist as a gene block, even in the endosymbionts Blochmannia and B. aphidicola.
The , α, β, and γ -proteobacteria species all have a conserved intact F1 complex (coded by the atpACDGH cluster), which predates their common ancestor. The genes included in the Fo complex in epsilon-proteobacteria (gene products atpB, atpE,atpF ) not in the same cluster as the genes making up F1. Furthermore, it is unclear whether the gene split that is only found in - proteobacteria is a split that predates the least common ancestor with the other proteobacteria clades, or whether it is a split introduced in the -proteobacteria. From the reconstructions pro- vided, the scenario appears to be the latter. Conversely, this observation may also be a result of the small number of species studied here. The species in the and α-proteobacteria display a known duplication of gene atpF. atpF 0 appears as a sister group to atpF (Koumandou and Kossida, 2014).
paaABCDEFGHIJK . The operon paaABCDEFGHIJK codes for genes involved in the catabolism of phenylacetate (Martin and McInerney, 2009). The ability to catabolize phenylacetate varies greatly between proteobacterial species, and even among different E. coli K-12 strains. In contrast with atpABCDEFG operon, which is conserved through many species, the operon paaABCDE-
FGHIJK is only found in full complement as an operon in some E. coli K-12 strains and some
Pseudomonas putida strains. While obviously less conserved than the atpABCDEFG operon, cer- tain orthoblocks appear to be conserved, providing possible partial functionality. The orthoblock paaABCDE is found in three Bordetella species and also in Bradyrhizobium diazoefficiens. The products of paaA, paaB, paaC, and paaE make up the subunits of the 1,2-phenylacetyl-CoA epox- idase, and paaD is hypothesized to form an iron-sulfur cluster with the product of paaE (Grishin et al., 2011). We did not find orthologs in the endosymbionts B. aphidicola and Blochmannia.
In both the local and global reconstructions (Fig S2 and Figure 5.2 respectively), only the ancestor of the Bordetella species have a combination of paaABC complex with paaE. It appears that this combination has full activity (Grishin et al., 2011). In addition, the global approach 44
ε-protebacteria
α-protebacteria
β-protebacteria
γ-protebacteria
Figure 5.2: Ancestral gene block reconstruction of operon paaABCDEFGHIJK using the global reconstruction approach. 45 predicts gene blocks just for the ancestors of α and most of γ-proteobacteria. Only the common ancestor of the Bordetella genus contains the cluster paaABCE. It has been confirmed that this cluster of genes is identical to those of E. coli (Luengo et al., 2001). In both approaches, gene paaF and paaG are not found to be in the same gene blocks. Hence, the ancestors are most likely missing the hydratase-isomerase complex. The paaJ thiolase catalyzes two steps in the phenylacetate catabolism (Ismail et al., 2003; Teufel et al., 2010; Nogales et al., 2007). In addition, paaH is the
NAD+-dependent 3-hydroxyadipyl-CoA dehydrogenase involved in phenylacetate catabolism(Ismail et al., 2003). Therefore, it makes sense that paaJ and paaH appear in most of the ancestral nodes that have gene blocks.
It is interesting to note that we see the formation of intermediate functional forms both in a highly conserved gene block atpIBEFHAGDC and the less conserved gene block based on the operon paaABCDEFGHIJK. Also, in both cases, the global approach performs better in terms of minimizing events. For brevity, we only provide the global ancestral reconstruction henceforth.
5.3.2 Operons using Bacillus subtilis as a reference taxon
B. subtilis is a Gram-positive, spore-forming bacterium commonly found in soil, and is also a normal gut commensal in humans. It is a model organism for Gram-positive, spore-forming bacteria, and as such its genome of about 4,450 genes is well annotated. Here we used ROAGUE to reconstruct the ancestors of two B. subtilis operons across 33 species. We selected species from the order Bacillales using PDA. Species from the following families were selected: Bacillaceae
(including the reference organism B. subtilis), Staphylococcae: macrococcus and staphylococcus,
Alicyclobacillaceae, Listeriaceae, and Planococcaceae.
lepA-hemN-hrcA-grpE-dnaK-dnaJ-prmA-yqeU-rimO. Gene block lepA-hemN-hrcA-grpE- dnaK-dnaJ-prmA-yqeU-rimO facilitates the heat shock response in B. subtilis, and the gene block hrcA-grpE-dnaK-dnaJ was the first identified heat shock operon within Bacillus s[[ (Wetzstein et al., 1992). The four genes hrcA, grpE, dnaK, dnaJ (e,c,b,a in Figure 5.3) form a tetracistronic structure, which is essential to the heat shock response role (Homuth et al., 1997). The four genes 46
Macrococcus
Paenibacillaceae
Staphylococcus
Alicyclobacillaceae
Bacillaceae Bacillales Family XII Listeriaceae
Planococcaceae
Figure 5.3: Ancestral reconstruction of lepA-hemN-hrcA-grpE-dnaK-dnaJ-prmA-yqeU-rimO. 47
Macrococcus
Paenibacillaceae
Staphylococcus
Alicyclobacillaceae
Bacillaceae Bacillales Family XII Listeriaceae
Planococcaceae
Figure 5.4: Ancestral reconstruction of mmgABCDE-prpB. 48 are proximal (they never separated in the course of evolution) in all the species examined, and form the core of the orthoblock. Overall, this operon is quite conserved, and the ancestral reconstructions are highly similar to the reference operon.
mmgABCDE-prpB. The operon mmgABCDE-prpB is expressed during endosporulation (Acharya,
2009). Subunit mmgABC ’s breakdown of fatty acids is a means for attaining energy to drive the cell’s preparation for dormancy (Quattlebaum, 2009). Hence, it is reasonable to see that the com- mon ancestor has this subunit. In addition, gene mmgD and gene prpB/yqiQ are predicted to be proximal. Several studies predicted that gene mmgD, prpB, and prpD encode the proteins of the putative methylcitrate shunt (Voigt et al., 2007). However, they did not specify if deletion mutations might contribute to a defect of the functionality. See Figure 5.4.
5.4 Conclusions
Operons offer a tractable model for the evolution of complexity. Understanding how simple units of genes may converge into an operon can lead us to a better understanding of how complex molecular systems evolve. Here we developed a method to reconstruct ancestral gene blocks using maximum parsimony. Using this method, we provide several examples of ancestral gene block reconstructions based on reference operons in E. coli and B. subtilis. Some interesting observations emerge regarding conservation and the ancestry of operons. From our examples, it appears that essentiality (the trait of being essential to life) and the formation of a protein complex are two drivers for gene block conservation. This is most apparent in the atpABCDEFG operon coding for
F1Fo-atpase in proteobacteria. There are few evolutionary events identified in the atpABCDEFG operon ancestry. In the Supplementary Materials, we provide a brief study of two more gene blocks. We observed that the ribose transporter block also seems to preserve the core ribose transporter (rbsABC ), while not the ribose phosphorylation genes rbsD and rbsK. ROAGUE also highlights intermediate functional forms of the orthoblocks, as we see in the pattern of conservation in paaABCDEFGHIJK. 49
It did not escape our notice that our model does not account for horizontal gene transfer, which has been shown to be a driver of operon dispersal in some cases species (Omelchenko et al.,
2003; Koonin, 2009). However, our model does set the stage for a new method for doing so.
Typically, detecting horizontal gene transfer is done by looking for the conservation of genes and gene structures between distant OTUs, and for anomalous codon usage (Koonin et al., 2001). Our method opens up a new way of HGT detection, by reconciling a species tree with an operon tree, in the same way that phylogenomic analyses do for gene trees and species trees (Eisen, 1998), which would be an exciting future development of this study. In addition, we ignore the gene order in the gene block. While the relationship between gene organization and expression in operons is not well understood, it is clear from several studies that gene order has an effect in some cases on the expression and functionality of the operon in general (e.g. (Hiroe et al., 2012; Wells et al., 2016;
Lim et al., 2011)). Adding the parameters of horizontal gene transfer, gene order preservation, or both to ROAGUE would be highly valuable. We will discuss potential extensions of our research in chapter7 50
CHAPTER 6. UNRAVELING A TANGLED SKEIN: EVOLUTIONARY ANALYSIS OF THE BACTERIAL GIBBERELLIN BIOSYNTHETIC OPERON
Gibberellin (GA) phytohormones are ubiquitous regulators of growth and developmental pro- cesses in vascular plants. The convergent evolution of GA production by plant-associated bacteria, including both symbiotic, nitrogen-fixing rhizobia, and phytopathogens, suggests that manipula- tion of GA signaling is a powerful mechanism for microbes to gain advantage in these interac- tions. Although homologous operons encode for GA biosynthetic enzymes in both rhizobia and phytopathogens, notable genetic heterogeneity and scattered operon distribution in these lineages suggest distinct functions for GA in varied plant-microbe interactions. Thus, deciphering GA operon evolutionary history could provide crucial evidence for understanding the distinct biological roles for bacterial GA production. To further establish the genetic composition of the GA operon, two operon-associated genes that exhibit limited distribution among rhizobia were biochemically characterized, verifying their roles in GA biosynthesis. In this chapter, we employ the previous maximum-parsimony ancestral gene block reconstruction algorithm to characterize loss, gain, and horizontal gene transfer (HGT) of GA operon genes within alphaproteobacteria rhizobia, which ex- hibit the most heterogeneity among GA operon-containing bacteria. Collectively, this evolutionary analysis reveals a complex history for HGT of both individual genes and the entire GA operon, and ultimately provides a basis for linking genetic content to bacterial GA functions in diverse plant-microbe interactions.
Within this chapter, the abbreviated term “α-rhizobia” is used to refer to symbiotic, nitrogen-
fixing rhizobia that belong to the alphaproteobacteria class. Likewise, the term “β-rhizobia” refers to symbiotic, nitrogen-fixing rhizobia from the betaproteobacteria class. For the α-rhizobia, the following genera were assessed because they have been found to contain the GA operon: Azorhizo- 51
Figure 6.1: Diversity among GA biosynthetic operons in divergent bacterial lineages. The core operon genes are defined as cyp112, cyp114, fd, sdr, cyp117, ggps, cps, and ks, as these are almost always present within the GA operon. Other genes, including cyp115, idi, and ids2, exhibit a more limited distribution among GA operon-containing species. The tandem diagonal lines in the Mesorhizobium loti MAFF303099 operon indicates that cyp115 is not located adjacent to the rest of the operon. bium, Bradyrhizobium, Ensifer/Sinorhizobium, Mesorhizobium, Microvirga, and Rhizobium (Nagel and Peters, 2017). The β-rhizobia was previously found to have the GA operon fall within the
Burkholderia and Paraburkholderia genera (Nagel et al., 2018). The different GA biosynthesis operons, or GA operons within the “α-rhizobia” can be found in figure 6.1.
6.1 Gene clusters analysis
While the GA operons found in gammaproteobacteria exhibit essentially uniform gene content and structural composition, those from the α-rhizobia exhibit much more diversity in genetic struc- 52
Input BLAST and Find homologous genes Trees Output
Species
Ancestral reconstruction of Retrieve and Gibberelin operon using annotated assembly Phylogenetic Tree files from NCBI Species Tree Species Tree Build a species tree (120 species) (64 species)
Reduce Tree using Reconstruct ancestral Gene Bank files PDA gene blocks using ROAGUE
Find homologous Concatenate multiple Operon Tree Operon Tree gene blocks to alignments and build (120 species) (64 species) Ancestral reconstruction of Gibberellin operon an operon tree Gibberelin operon using Genes in Operon Tree Gibberellin operon from Xanthomonas oryzae
Figure 6.2: Pipeline of the method to reconstruct the ancestral state of gibberellin operon ture. This suggests that selective pressures specific to the rhizobia, presumably their symbiotic relationship with legumes, may have driven this heterogeneity in the operon. To better understand the evolutionary history of the GA operon in the α-rhizobia, a more thorough analysis was carried out with ROAGUE (chapter5. ROAGUE generates a phylogenetic tree with selected taxa that contain gene blocks (i.e. gene clusters) of interest, and then uses a maximum parsimony approach to reconstruct a predicted gene block structure at each ancestral node of the tree. Using the
ROAGUE method, the evolutionary events involved in the genetic construction of orthologous GA operon gene blocks in the α-rhizobia, specifically gene loss, gain, and duplication, were quantita- tively assessed (see Figure 6.2 for a summary of the method pipeline). A total of 118 α-rhizobia with the GA operon were included in this analysis (see Table 6.2 for a list of the strains). The most phylogenetically distant GA operon to those in the α-rhizobia (lowest sequence similarities to the house keeper gene rpoB) is found within E. tracheiphila (Nagel and Peters, 2017), and as such, this was used as an outgroup. Additionally, to observe the relative relationship between alpha- and gamma- proteobacterial operons, the GA operon from Xanthomonas oryzae pv. oryzicola BLS256
(Xoc) was also included in the analysis.
An initial reconstruction was made by creating a species tree using the amino acid sequence of rpoB (RNA polymerase β subunit) from each strain as the phylogenetic marker gene (“full species tree” or FS) (Figure 6.6). However, based upon the predicted HGT of the operon, it 53 was reasoned that a species tree may not accurately depict GA operon evolution, as operon genes from distantly-related bacteria may be more similar than the relevant rpoB genes if the operon was acquired via HGT. Therefore, a second tree was created based on a concatenation of protein sequences comprising the core GA operon (“full operon tree” or FO) (Figure 6.7). Due to the large number of species being analyzed, along with apparent phylogenetic redundancy that could introduce bias, reconstructions were also made with only the more distinct representative strains by using the Phylogenetic Diversity Analyzer (PDA) program, which reduced the number of analyzed taxa to 64 (Chernomor et al., 2015). These reduced phylogenetic trees are referred to as the “partial species tree” or PS (Figure 6.3), and the “partial operon tree” or PO (Figure 6.4). In Figure 6.3 and Figure 6.4, the lower-case letters in each tree node represent the genes in the orthoblock (e.g.
“a” represents “cyp115 ”), with each gene additionally indicated by a unique color (see the legend at top of figure). A blank space between genes designates a split ≥500bp between the genes to either side of the blank space. The green bar on the top left of the figure displays the total number of events that occur in this reconstruction. For each inner node u, the floating number (e.g. 98.0) represents the bootstrap value of the tree. The numbers in the brackets indicate the cumulative count of events going from the leaf nodes to node u in the following order: [deletions, duplications, splits]. Each leaf node is accompanied by symbols (*, ?, !), the genomic accession number, the species/strain name, and the gene block for that strain. An asterisk (*) indicates the gene block contains full-length cyp115 (gene “a”); an exclamation (!) indicates that the gene block contains a truncation/fragment of cyp115, and a question mark (?) indicates the gene block contains ggps2
(gene “k”). The reference strain, Xanthomonas oryzae pv. oryzicola BLS256, is in blue, and the outgroup strain, Erwinia tracheiphila PSU-1, is in gray. These naming and color conventions persist through this study.
The ability of different ancestral reconstructions to capture the likely evolution of a gene cluster can be assessed by the number of events (loss, gain, and duplication) calculated by this method, with a lower number of events indicating a more parsimonious reconstruction. From this analysis, it was found that fewer evolutionary events are reconstructed in FO (75 events) than in FS (121 events) 54
Deletion count: 60 Duplication count: 0 Split count: 18 a : cyp115 b : cyp112 c : cyp114 d : fd e : sdr f : cyp117 g : ggps h : cps i : ks j : idi k : ggps2 p : cyp115-pseudo/fragment KE136308.1 Erwinia tracheiphila PSU-1 a b c d e f g h i j CP003057.2 Xanthomonas oryzae BLS256(reference) a b c d e f g h i j
b c d e f g h i AZYE01000987.1 Microvirga lupini Lut6 b d e f g h i j 25.0 AXBA01000001.1 Azorhizobium doebereinerae UFLA1-100 b c d e f g h i [2, 0, 1] !AUFA01000001.1 Bradyrhizobium sp. Cp5.3 p b c d e f g h i
b c d e f g h j !KI911783.1 Bradyrhizobium sp. WSM1417 b c d e f g h j p b c d e f g h i 97.0 !KB890498.1 Bradyrhizobium sp. WSM4349 p c d f g h j 99.0 [2, 0, 1] b c d e f g h i [12, 0, 3] 100.0 b c d e f g h i !AJQL01000001.1 Bradyrhizobium yuanmingense CCBAU 35157 e f g h i p [17, 0, 5] b c d e f g h i b c d e f g h i 50.0 !AJQE01000001.1 Bradyrhizobium sp. CCBAU 43298 p b c d e f g h i 86.0 b c d e f g h i 100.0 [3, 0, 0] [12, 0, 3] 48.0 [3, 0, 0] !AXAT01000001.1 Bradyrhizobium japonicum USDA 135 p b c d e f g h i [4, 0, 0] AXBC01000011.1 Bradyrhizobium genosp. SA-4 str. CB756 b c e f g h i a b c d e f g h i j b c d e f g h i ?AXAY01000001.1 Bradyrhizobium sp. WSM3983 b c d e f h i k 100.0 b c d e f g h i 70.0 b c d e f h i k [60, 0, 18] 100.0 [8, 0, 2] 11.0 k b c d e f h i !?AXAB01000001.1 Bradyrhizobium sp. WSM2254 k b c d e f h i p [15, 0, 4] [2, 0, 1] 58.0 !KB893805.1 Bradyrhizobium japonicum USDA 124 p b c d e f g h i b c d e f g h i [2, 0, 1] 18.0 [4, 0, 2] b c d e f g h i !KB902768.1 Bradyrhizobium sp. WSM2793 b c d e f g h i p 44.0 FMAI01000001.1 Bradyrhizobium sp. ERR11 b c d e f g h i [0, 0, 0] !AXAU01000001.1 Bradyrhizobium elkanii WSM1741 p b c d e f g h i b c d e f g h i !AUGA01000001.1 Bradyrhizobium sp. th.b2 b c d e f g h i p 98.0 b c d e f g h i !LFIP02000001.1 Bradyrhizobium embrapense SEMIA 6208 b c d e f g h i p [3, 0, 1] 96.0 b c d e f g h i [3, 0, 1] 21.0 b c d e f g h i !LJYE01000001.1 Bradyrhizobium pachyrhizi BR3262 p b c d e g h i [3, 0, 1] b c d e f g h i 25.0 !LFIQ01000001.1 Bradyrhizobium pachyrhizi PAC 48 p b c d e f g h i 5.0 [1, 0, 1] b c d e f g h i [3, 0, 1] !KB900701.1 Bradyrhizobium elkanii USDA 76 c e f g h i p 100.0 AHAM01000001.1 Mesorhizobium alhagi CCNWXJ12-2 b c d e f g h i j [58, 0, 18] ATYO01000001.1 Mesorhizobium sp. WSM3224 b c d e f g h i j b c d e f g h i j 100.0 a b c d e f g h i j *AZUX01000001.1 Mesorhizobium sp. WSM2561 a b c d e f g h i j [13, 0, 3] b c d e f g h i j 99.0 *LYTO01000001.1 Mesorhizobium sp. AA22 a b c d e f g h i j 100.0 [0, 0, 1] [13, 0, 3] AGSN01000001.1 Mesorhizobium amorphae CCNWGS0123 b c d e f g h i j a b c d e f g h i j b c d e f g h i j *CAAF010000001.1Mesorhizobium sp. STM 4661 a b c d e f g h i j 84.0 33.0 b c d e f g h i j [12, 0, 3] [2, 0, 0] 33.0 b c d e f g h i j AYVY01000001.1 Mesorhizobium sp. LSHC420B00 b c d e f g h j [2, 0, 0] 27.0 AYWZ01000001.1 Mesorhizobium sp. L2C066B000 b c d e f g h i j [1, 0, 0] a b c d e f g h i j *AXAL01000001.1 Mesorhizobium loti CJ3sym a b c d e f g h i j 50.0 a b c d e f g h i j [12, 0, 2] 30.0 b c d e f g h i j !AZUV01000001.1 Mesorhizobium sp. WSM1293 b c d e f g h i j p [4, 0, 1] 39.0 CP002279.1 Mesorhizobium opportunistum WSM2075 d f g h i j [3, 0, 1] a b c d e f g h i j 52.0 a i b c d e f g h j CP003358.1 Mesorhizobium australicum WSM2073 d f g h i j [9, 0, 1] 68.0 *AZUY01000001.1 Mesorhizobium sp. WSM3626 a b c d e f g h j a b c d e f g h i j [5, 0, 0] 30.0 *AXAE01000005.1 Mesorhizobium erdmanii USDA 3471 a b c d e f g h i j b c d e f g h i [5, 0, 0] a b c d e f g h i j 100.0 31.0 a b c d e f g h i j *LYTK01000001.1 Mesorhizobium loti NZP2042 b c d e f g h i j a [41, 0, 13] [0, 0, 0] 82.0 *AP003017.1 Mesorhizobium loti MAFF303099 a b c d e f g h i j [0, 0, 0] JFGO01000001.1 Sinorhizobium americanum CCGM7 b c d e f g h i d f h i j b c e g AJQT01000001.1 Ensifer sojae CCBAU 05684 d f h i j e 41.0 d f h i j e [8, 0, 2] 47.0 d f h i j e AJQN01000001.1 Sinorhizobium fredii CCBAU 25509 d f h i j e d f h i j b c e g [4, 0, 1] 100.0 CP000874.1 Sinorhizobium fredii NGR234 b c d e f g h i 41.0 [4, 0, 1] [15, 0, 4] !CP003563.1 Sinorhizobium fredii USDA 257 c d f h i j e p c d f h i j b e g *ATYB01000007.1 Sinorhizobium arboris LMG 14919 a b c d e f g i 42.0 a b c d e f g h i j b c d e f g h i [7, 0, 2] 84.0 a b c d e f g h i j AQWP01000001.1 Sinorhizobium meliloti 4H41 b d e f g h i j 100.0 [4, 0, 2] 79.0 *ATYC01000007.1 Sinorhizobium meliloti WSM4191 b c d e f g h i j a [17, 0, 5] [2, 0, 1]
b c d e f g h i AZNX01000001.1 Ensifer sp. TW10 b c d e f g h i 58.0 *AZUW01000001.1 Ensifer sp. WSM1721 b c d e f g h i a b c d e f g h i [1, 0, 0] 100.0 FMAF01000001.1 Rhizobium lusitanum P1-7 b c d e f g h i [27, 0, 10] b c d e f g h i AUFB01000001.1 Rhizobium leucaenae USDA 9039 b c d e f g h i 100.0 b c d e f g h i [0, 0, 0] 31.0 b c d e f g h i !AQHN01000001.1 Rhizobium freirei PRF 81 b c d e f g h i p [0, 0, 0] 50.0 !CP004015.1 Rhizobium tropici CIAT 899 p b c d e f g h i [0, 0, 0] b c d e f g h i 100.0 b c d e f g h i *ATTO01000001.1 Rhizobium favelukesii OR191 a b c d e f g h i j [10, 0, 5] b c d e f g h i 37.0 AEYE02000001.1 Rhizobium grahamii CCGE 502 b c d e f g h i 100.0 [2, 0, 0] [2, 0, 0] HF536772.1 Rhizobium mesoamericanum STM3625 b c d e f g h i b c d e f g h i *ATTQ01000001.1 Rhizobium mongolense USDA 1844 a b c d e f g h i j 99.0 [10, 0, 5] b c d e f h i j k ?CP000133.1 Rhizobium etli CFN 42 b c d e f h i j k b c d e f g h i j b c d e f h i j k 44.0 ?CP007641.1 Rhizobium etli bv. phaseoli str. IE4803 b e f h i j k 69.0 69.0 [2, 0, 2] [7, 0, 5] b c d e f h i j k [2, 0, 3] !?FMAJ01000001.1 Rhizobium sp. HBR26 p k b c d e f h i j 100.0 !?LFIO01000001.1 Rhizobium ecuadorense CNPSO 671 p b c d e f h i j k [4, 0, 4] b c d e f h i j k b c d e f h i j k !?AEYF01000001.1 Rhizobium sp. CCGE 510 b c d e f h i j k p 97.0 46.0 ?LJSR01000001.1 Rhizobium acidisoli FH23 b c d e f h i j k [2, 0, 1] b c d e f h i j k [0, 0, 0] 51.0 [2, 0, 1] b e f h i j k ?KB905373.1 Rhizobium leguminosarum bv. phaseoli 4292 b e f h i j k 55.0 ?ATTN01000001.1 Rhizobium leguminosarum bv. phaseoli FA23 b e f h i j k [0, 0, 0] 0.09
Figure 6.3: Reduced ancestral reconstruction of the GA biosynthetic operon using rpoB for phylo- genetic analysis. A phylogenetic tree was constructed with alignments of rpoB protein sequences from 118 α-rhizobia species and two gammaproteobacteria using the Neighbor-Joining method as a measure of the distance between species. The number of analyzed species was reduced to 64 with the Phylogenetic Diversity Analyzer software, and ROAGUE was then applied to create the ancestral operon reconstruction. Annotations are defined in section 6.1. 55
Deletion count: 48 Duplication count: 0 Split count: 14 a : cyp115 b : cyp112 c : cyp114 d : fd e : sdr f : cyp117 g : ggps h : cps i : ks j : idi k : ggps2 p : cyp115-pseudo/fragment KE136308.1 Erwinia tracheiphila PSU-1 a b c d e f g h i j
b c d e f g h i HF536772.1 Rhizobium mesoamericanum STM3625 b c d e f g h i 100.0 AEYE02000001.1 Rhizobium grahamii CCGE 502 b c d e f g h i [0, 0, 0] CP003057.2 Xanthomonas oryzae BLS256(reference) a b c d e f g h i j !?AEYF01000001.1 Rhizobium sp. CCGE 510 b c d e f h i j k p b c d e f h i j k 100.0 b c d e f h i j k !?FMAJ01000001.1 Rhizobium sp. HBR26 p k b c d e f h i j [0, 0, 1] 87.0 ?CP000133.1 Rhizobium etli CFN 42 b c d e f h i j k b c d e f g h i j b c d e f g h i j [0, 0, 1] 100.0 100.0 [48, 0, 14] [3, 0, 2] b c d e f g h i CP005950.1 Rhizobium etli bv. mimosae str. Mim1 b c d e f g h i b c d e f g h i j 100.0 !CP006986.1 Rhizobium etli bv. mimosae str. IE4771 p b c d e f g h i 40.0 [0, 0, 0] [6, 0, 3] AJUI01000023.1 Rhizobium leguminosarum bv. trifolii CC278f b c d e f g h j b c d e f g h i j b c d e f g h i j a b c d e f g h i j *ATYC01000007.1 Sinorhizobium meliloti WSM4191 b c d e f g h i j a 81.0 b c d e f g h i j 81.0 a b c d e f g h i j 15.0 ATYO01000001.1 Mesorhizobium sp. WSM3224 b c d e f g h i j [47, 0, 14] 96.0 [3, 0, 1] 76.0 [1, 0, 0] [8, 0, 4] [1, 0, 1] *ATTO01000001.1 Rhizobium favelukesii OR191 a b c d e f g h i j AXBC01000011.1 Bradyrhizobium genosp. SA-4 str. CB756 b c e f g h i b c d e f g h i !LJYF01000001.1 Bradyrhizobium yuanmingense BR3267 p b c e f g h i d 18.0 b c d e f g h i [1, 0, 1] 100.0 b c d e f g h i FMAI01000001.1 Bradyrhizobium sp. ERR11 b c d e f g h i [0, 0, 1] 53.0 !KB902768.1 Bradyrhizobium sp. WSM2793 b c d e f g h i p [0, 0, 0]
b c d e f g h i j c d e f g h !AJQL01000001.1 Bradyrhizobium yuanmingense CCBAU 35157 e f g h i p 100.0 c d e f g h j 100.0 !AXAZ01000001.1 Bradyrhizobium sp. WSM1743 c d e f g h p [46, 0, 14] 70.0 [3, 0, 1] [5, 0, 1] !KB890498.1 Bradyrhizobium sp. WSM4349 p c d f g h j c d e f g h i j !CP003563.1 Sinorhizobium fredii USDA 257 c d f h i j e p 73.0 d f h i j e [10, 0, 1] 100.0 d f h i j e HE616890.1 Sinorhizobium fredii HH103 d f h i j e c d e f g h i j d e f g h i j [1, 0, 0] 100.0 AJQT01000001.1 Ensifer sojae CCBAU 05684 d f h i j e 100.0 70.0 [0, 0, 0] [12, 0, 2] [3, 0, 0] CP002447.1 Mesorhizobium ciceri biovar biserrulae WSM1271 d f g h i j !KB900701.1 Bradyrhizobium elkanii USDA 76 c e f g h i p b c d e f g h i j AXBA01000001.1 Azorhizobium doebereinerae UFLA1-100 b c d e f g h i 44.0 KI911771.1 Rhizobium leguminosarum bv. trifolii CC283b b c d e f g h j [38, 0, 10] b c d e f g h i !LJYE01000001.1 Bradyrhizobium pachyrhizi BR3262 p b c d e g h i b c d e f g h i j 32.0 !LFIQ01000001.1 Bradyrhizobium pachyrhizi PAC 48 p b c d e f g h i 52.0 [1, 0, 1] [25, 0, 8] b c d e f g h i 10.0 b c d e f g h i !LFIP02000001.1 Bradyrhizobium embrapense SEMIA 6208 b c d e f g h i p [1, 0, 1] 56.0 !AXAU01000001.1 Bradyrhizobium elkanii WSM1741 p b c d e f g h i b c d e f g h i j b c d e f g h i [0, 0, 0] 46.0 9.0 !AXAT01000001.1 Bradyrhizobium japonicum USDA 135 p b c d e f g h i [24, 0, 8] [0, 0, 0] b c d e f g h i 100.0 b c d e f g h i !BA000040.2 Bradyrhizobium diazoefficiens USDA 110 p b c d e f g h i [0, 0, 0] 88.0 !AP014659.1 Bradyrhizobium diazoefficiens NK6 b c d e f g h i p [0, 0, 0] !CP015064.1 Mesorhizobium ciceri biovar biserrulae WSM1284 b c d e f g h i j p AUFB01000001.1 Rhizobium leucaenae USDA 9039 b c d e f g h i b c d e f g h i j b c d e f g h i 8.0 96.0 b c d e f g h i CP006877.1 Rhizobium gallicum bv. gallicum R602 b c d e f g h i [23, 0, 8] [2, 0, 1] 94.0 AZYE01000987.1 Microvirga lupini Lut6 b d e f g h i j b c d e f g h i [2, 0, 1] 60.0 [4, 0, 2] b c d e f g h i j b c d e f g h i JFGO01000001.1 Sinorhizobium americanum CCGM7 b c d e f g h i 73.0 96.0 AQWP01000001.1 Sinorhizobium meliloti 4H41 b d e f g h i j [6, 0, 2] [2, 0, 1] AYVY01000001.1 Mesorhizobium sp. LSHC420B00 b c d e f g h j
b c d e f g h i j AYWZ01000001.1 Mesorhizobium sp. L2C066B000 b c d e f g h i j b c d e f g h i j b c d e f g h i j 90.0 AYVO01000001.1 Mesorhizobium sp. LSJC268A00 b c d e f g h i j 21.0 53.0 [0, 0, 0] b c d e f g h i j [21, 0, 7] b c d e f g h i j [0, 0, 0] AYWN01000001.1 Mesorhizobium sp. LNJC372A00 b c d e f g h i j 23.0 84.0 [9, 0, 3] b c d e f g h i j [0, 0, 0] b c d e f g h i j AYXF01000001.1 Mesorhizobium sp. L103C105A0 b c d e f g h i j 100.0 68.0 AYVX01000001.1 Mesorhizobium sp. LSHC422A00 b c d e f g h i j b c d e f g h i j [0, 0, 0] b c d e f g h i j [0, 0, 0] a b c d e f g h i j 83.0 AYVP01000001.1 Mesorhizobium sp. LSJC265A00 b c d e f g h i j 29.0 a b c d e f g h i j 47.0 [0, 0, 0] AYWU01000001.1 Mesorhizobium sp. L48C026A00 b c d e f g h i j [11, 0, 3] b c d e f g h i j 80.0 [1, 0, 1] *AP003017.1 Mesorhizobium loti MAFF303099 a b c d e f g h i j 39.0 [1, 0, 1] *ATTQ01000001.1 Rhizobium mongolense USDA 1844 a b c d e f g h i j [3, 0, 1] !JH600072.1 Bradyrhizobium sp. WSM1253 b c d e f g h i p *AZUY01000001.1 Mesorhizobium sp. WSM3626 a b c d e f g h j b c d e f g h i j 13.0 b c d e f g h i !KI911783.1 Bradyrhizobium sp. WSM1417 b c d e f g h j p [21, 0, 7] 24.0 !AUFA01000001.1 Bradyrhizobium sp. Cp5.3 p b c d e f g h i b c d e f g h i [2, 0, 0] 13.0 [4, 0, 0] k b c d e f h i !?AXAB01000001.1Bradyrhizobium sp. WSM2254 k b c d e f h i p 90.0 ?AXAY01000001.1 Bradyrhizobium sp. WSM3983 b c d e f h i k b c d e f g h i j [0, 0, 0] 31.0 *LYTO01000001.1 Mesorhizobium sp. AA22 a b c d e f g h i j [10, 0, 4] b c d e f g h i h i LATE01000001.1 Sinorhizobium sp. PC2 b c d e f g h i h i 57.0 CP000874.1 Sinorhizobium fredii NGR234 b c d e f g h i b c d e f g h i j b c d e f g h i h i [0, 0, 1] 4.0 28.0 [5, 0, 4] [2, 0, 3] b c d e f g h i a *AZUW01000001.1Ensifer sp. WSM1721 b c d e f g h i a 99.0 *ATYB01000007.1 Sinorhizobium arboris LMG 14919 a b c d e f g i b c d e f g h i j [1, 0, 1] 4.0 [4, 0, 4] b c d e f g h i j !AUGA01000001.1 Bradyrhizobium sp. th.b2 b c d e f g h i p b c d e f g h i j 33.0 AGSN01000001.1 Mesorhizobium amorphae CCNWGS0123 b c d e f g h i j 4.0 [1, 0, 0] [1, 0, 0] AHAM01000001.1 Mesorhizobium alhagi CCNWXJ12-2 b c d e f g h i j 0.11
Figure 6.4: Reduced ancestral reconstruction of the GA biosynthetic operon using the concate- nated operon for phylogenetic analysis. A phylogenetic tree was constructed with alignments of concatenated proteins from core GA operon genes (cyp112-cyp114-fd-sdr-cyp117-cps-ks) from 118 α-rhizobia species and two gammaproteobacteria using the Neighbor-Joining method. The num- ber of analyzed species was reduced to 64 with the Phylogenetic Diversity Analyzer software, and ROAGUE was then applied to create the ancestral operon reconstruction. and ROAGUE was then applied to create the ancestral operon reconstruction. Annotations are defined in section 6.1. 56
(Figures 6.6 and 6.7), with the same relative trend observed with the partial trees (62 events for
PO vs. 78 events for PS) (Figures 6.3 and 6.4). The greater parsimony observed in reconstructions built with alignments of the concatenated GA operon strongly supports the previously suggested hypothesis of HGT among α-rhizobia (Hershey et al., 2014). Accordingly, the reconstructions based on GA operon similarity (i.e. FO and PO) were used for further analyses of operon inheritance.
In contrast to the phytopathogens, a full-length cyp115 gene is absent from the genomes of most rhizobia (including both α- and β- rhizobia). Instead, α- and β- rhizobia typically have only the core operon and, hence, can only produce the penultimate intermediate GA9 rather than bioactive
GA4 (M´endezet al., 2014; Nagel and Peters, 2017; Nett et al., 2017a). ROAGUE analysis indicates that cyp115 loss occurred soon after α-rhizobia acquisition of the GA operon, as the reconstructed ancestral node that connects the α-rhizobia to Xoc (and the rest of the gammaproteobacteria) does not contain cyp115 (Figure 6.4). Although the α-rhizobia presumably acquired their GA operon from a gammaproteobacterial ancestor, the gammaproteobacteria seem to always have cyp115 at the 5’ end of the operon. In contrast, the α-rhizobia typically only has a partial cyp115 pseudo- gene/fragment located at this position, as previously described (Tully et al., 1998; Nett et al.,
2017a). This suggests that the original operon acquired by an α-rhizobia ancestor contained cyp115, and that this gene was subsequently lost. Interestingly, it has been previously reported that cyp115 also is absent from the GA operons of β-rhizobia, which seem to have independently gained their operon from a gammaproteobacterial progenitor (Nagel et al., 2018).
Although cyp115 is absent in most α-rhizobia, a subset of α-rhizobia (< 20%) with the GA operon also has a full-length, functional cyp115. However, only in one strain (Mesorhizobium sp.
AA22) does the GA operon have cyp115 in the same location as in gammaproteobacterial GA operons (Nett et al., 2017a). Strikingly, ROAGUE analysis indicates that full-length cyp115 has been regained independently in at least three different lineages, which is apparent in either the
PS or PO reconstructions (Figures 6.3 and 6.4). Indeed, other than in Mesorhizobium sp. AA22, these full-length cyp115 reside in alternative locations relative to the rest of the GA operon (e.g. 3’ 57 end of the operon, or distally located), as previously described (Nett et al., 2017a), which further supports the independent acquisition of cyp115 via an additional HGT event.
Similar to cyp115, a full-length idi gene is generally present in the GA operons of gammapro- teobacteria, and is only sporadically present in the GA operons of α-rhizobia. Collectively, 62 of the
118 α-rhizobia strains analyzed here possess this gene, which seems to invariably exhibit analogous positioning – i.e. at the 3’ end of the operon, as found in the gammaproteobacterial GA operons.
Our ROAGUE analysis indicates that the ancestral strain with the GA operon likely possessed idi, and that this gene has subsequently been lost in many α-rhizobia strains. However, there are no- table differences among losses of this gene within the major α-rhizobia genera. For example, while the presence of idi appears to be almost random within Rhizobium (16/26 strains) and Sinorhizo- bium/Ensifer (8/14), it is nearly absent from all Bradyrhizobium (2/40), but ubiquitously found in Mesorhizobium (36/36).
Not surprisingly, ggps2 seems to be invariably associated with operons in which the canonical ggps is inactive (Figures 6.3 and 6.4), and is only found in 13 out of the 118 α-rhizobia analyzed in this study. However, the ancestral reconstructions further indicate that ggps2 is present in two distinct clades in all trees; one composed of closely related Rhizobium strains, and another with two
Bradyrhizobium strains. While the Rhizobium all have homologous mutations in ggps, with similar positioning of ggps2 (within 500 bp of the 3’ end of the operon), the two Bradyrhizobium have distinct ggps mutations, with ggps2 positioned on opposite sides of the operon. There is higher homology between the GGPS2 proteins within each of the two clades than between them (Table
6.1), suggesting that each acquired ggps2 independently. However, the lack of synteny between the two Bradyrhizobium strains, as well as distinct lesions in ggps, suggests that these each may have separately acquired ggps2 as well. Given that the GGPS2 proteins are all much more closely related to each other (>83% amino acid sequence identity) than to any other homologs (<45% identity), ggps2 appears to have undergone HGT within the α-rhizobia following the initial acquisition by a GA operon in which ggps was lost/inactivated. Nevertheless, acquisition of ggps2 was clearly 58
Figure 6.5: Summary of ancestral reconstruction for the GA biosynthetic operon. As a repre- sentation of GA operon evolution, the results from the full reconstruction generated using the concatenated operon (FO) are summarized here. The initial loss of the cyp115 gene is indicated with a red arrow, while the reacquisition of this gene is indicated with a green circle at the ances- tral node. The loss of ggps and acquisition of ggps2 is indicated by a blue box at the ancestral node. Brackets around a gene represent a variable presence within that lineage. Double slanted lines indicate genes that are not located within the cluster (i.e. >500 bp away). The family of proteobacterial lineages is indicated to the right of the figure (α and γ labels). 59 followed by vertical transmission of the modified GA operon in the case of the larger and more homologous Rhizobium containing clade.
In addition to ancestral gene loss and gain events, there also have been fusions between con- stituent biosynthetic genes within the GA operon. In some α-rhizobia, the fdGA gene, which is usually a distinct coding sequence, is found in-frame with either the 5’ proximal cyp114 gene or the 3’ proximal sdrGA gene, resulting in either cyp114-fd or fd-sdr fusions, which presumably act as bifunctional polypeptides. As fusion events are not factored into ROAGUE, these were assessed and categorized manually (Tables 6.3 and 6.4). The cyp114-fd fusion is only found in a single clade consisting of almost entirely Rhizobium species, which is most evident in the FO reconstruction
(Figure 6.7). By contrast, while the fd-sdr fusion is largely found in a clade consisting of mostly
Mesorhizobium species (Figure 6.7), such fusion appears to have independently occurred in another clade as well. This fusion seems to be functional, as the activity of the fused Fd-SDR enzyme in
M. loti MAFF303099 has been biochemically verified (Tatsukami and Ueda, 2016). Beyond these multiple observations in α-rhizobia, it should be noted that a functional fd-sdr fusion appears to have independently arisen in the β-rhizobia as well (Nagel et al., 2018), indicating that this is not functionally problematic. While the functionality of the CYP114-Fd fusion has not been demon- strated to date, the cooperative activity of the two encoded enzymes (Nett et al., 2017b) supports their ability to be functionally incorporated into a single polypeptide (Marcotte et al., 1999).
6.2 Results and Discussion
Collectively, our analyses demonstrate a complex history of GA operon function, distribu- tion, and evolution within the proteobacteria (Figure 6.5). The ROAGUE analysis reported here are consistent with the hypothesis that the GA operon has undergone HGT between var- ious plant-associated bacteria, including phytopathogenic gammaproteobacteria and symbiotic, nitrogen-fixing α- and β- rhizobia. As an added layer of complexity, it is generally accepted that the large symbiotic or pathogenic genomic islands or plasmids (i.e. symbiotic or pathogenic mod- ules), which enable the plant-associated lifestyle of these bacteria, are capable of undergoing HGT 60
(Dobrindt et al., 2004). For α-rhizobia strains where sufficient genomic information is available, the GA operon is invariably found within the symbiotic module Sullivan et al. (2002); Freiberg et al. (1997); Gonz´alezet al. (2003); G¨ottfertet al. (2001); Kaneko et al. (2002); Uchiumi et al.
(2004). Interestingly, HGT of the GA operon independently of the symbiotic module has been previously suggested based on phylogenetic incongruences between genes representative of species
(16S rRNA), symbiotic module (nifK ), and GA operon (cps) similarity (Hershey et al., 2014).
Thus, there appear to be multiple levels of HGT with the GA operon in α-rhizobia: 1) acquisition of the symbiotic module (i.e. symbiotic plasmid or genomic island), either with or without the GA operon, 2) separate acquisition of the GA operon within the symbiotic module, and 3) subsequent acquisition of auxiliary genes, including ggps2 and cyp115.
Although widespread within proteobacteria, the GA operon has thus far only been found in plant-associated species (Levy et al., 2018). While this is not surprising due to the function of
GA as a phytohormone, it emphasizes that such manipulation of host plants is an effective mech- anism for bacteria to gain selective advantage. Indeed, the ability to produce GA seems to be a powerful method of host manipulation for plant-associated microbes more generally, as certain phytopathogenic fungi also have convergently evolved the ability to produce GA as a virulence factor (Wiemann et al., 2013; Hedden and Sponsel, 2015).
Despite the wide-ranging HGT of the GA operon between disparate classes of proteobacteria, its scattered distribution within each of these classes strongly indicates that the ability to produce
GA only provides a selective advantage under certain conditions. This is evident for both symbiotic rhizobia and bacterial phytopathogens. For example, Xoc, the causal agent of bacterial leaf streak in rice, produces GA as a virulence factor to suppress the plant jasmonic acid (JA) induced defense response (Lu et al., 2015; Navarro et al., 2008; Hou et al., 2010). By contrast, the production of
GA by the α-rhizobia M. loti MAFF303099 limits the formation of additional nodules, apparently without having a negative impact on plant growth 4.
The occurrence of GA operon fragments (i.e. presence of some, but not all necessary biosyn- thetic operon genes) in many rhizobia would seem to indicate that the production of GA is not 61 advantageous in all rhizobia-legume symbioses. For example, at the onset of this study, we identified
>160 α-rhizobia with an apparent homolog of at least one GA operon gene, yet only 120 of these contained a gene cluster (i.e. two or more biosynthetic genes clustered together), and many of these clusters are clearly non-functional due to the absence of crucial biosynthetic genes. It has been suggested that the GA operon is associated with species that inhabit determinate nodules (Hershey et al., 2014), as these nodules grow via cell expansion (an activity commonly associated with GA signaling (Schwechheimer, 2012), and not indeterminate nodules, which grow via continuous cell division (Oldroyd et al., 2011). However, while the presence of the GA operon does seem to be somewhat enriched within rhizobia that associate with determinate nodule-forming legumes (Ta- ble 6.2), there are many examples of rhizobia with complete GA operons that were isolated from indeterminate nodules. For example, while most GA operon-containing Bradyrhizobium species associate with determinate nodule-forming plants, many species from the Ensifer/Sinorhizobium,
Mesorhizobium, and Rhizobium genera with the operon were isolated from indeterminate nodules, as were all three of the β-rhizobia with the GA operon (Integrated Microbe Genomes, JGI). Addi- tionally, many rhizobia, 368 including some with the GA operon such as S. fredii NGR234 (Pueppke and Broughton, 1999), are capable of symbiosis with either type of legume – i.e., those forming either determinant or indeterminate nodules. These inconsistencies raise the question of why only some rhizobia have acquired and maintained the GA operon, and thus the capacity to produce GA.
In addition to its scattered distribution, the operon exhibits notable genetic diversity within the
α-rhizobia. For example, ROAGUE analysis indicates that loss of the usual ggps and subsequent recruitment of ggps2 has been followed by HGT of this to other operons in which ggps has been inactivated. While it is difficult to infer the original source of ggps2, it is of note that the closest non-GA operon homologs are found clustered together with genes related to photosynthesis (e.g. in the alphaproteobacteria Hyphomicrobium sp. ghe19). Though this specific gene has not been functionally characterized, it seems reasonable that these homologs may be involved in producing
GGPP as a precursor of photosynthetic pigments – i.e. phytol and/or carotenoids (Zi et al., 2014). 62
While the loss/gain of GGPP-synthase genes represents rather dramatic events in GA operon evolution, other modifications to the operon may have subtle yet informative effects upon GA production. In particular, although the loss of idi (and potentially the observed gene fusions events) may reduce the rate of GA production, it appears that this can be easily accommodated in certain rhizobia-legume pairings. Indeed, the expression of the GA operon is delayed in this symbiotic relationship (M´endez et al., 2014), perhaps to mitigate any deleterious effects of GA during early nodule formation, which has been shown to be inhibitory to nodule formation, at least at higher concentrations Hayashi et al. (2014).
Perhaps the most striking evolutionary aspect of the rhizobial GA operons is the early loss and scattered re-acquisition of cyp115 in α-rhizobia. While almost all operon-containing α-rhizobia contains remnants of cyp115 at the position in the GA operon where it is found in gammapro- teobacteria phytopathogens (Nett et al., 2017a), there is only one strain (Mesorhizobium sp. AA22) where a full-length copy is found in this location. Phylogenetic analysis further suggests that cyp115 from Mesorhizobium sp. AA22 is closest to the ancestor of all the full-length copies found in α- rhizobia, which are all found at varied locations relative to the GA operon (Nett et al., 2017a).
The ROAGUE analysis reported here indicates that cyp115 was lost shortly after the acquisition of the ancestral GA operon by α-rhizobia, despite full-length copies being present in several different lineages. Accordingly, these results support the hypothesis that cyp115 has been re-acquired by this subset of rhizobia via independent HGT events. Notably, while not recognized in the original report (Tatsukami and Ueda, 2016), this includes M. loti MAFF303099, the only strain in which the biological role of rhizobial production of GA has been examined. Because cyp115 is required for endogenous production of bioactive GA4 from the penultimate (inactive) precursor GA9, this pro- vides additional intrigue regarding the selective pressures driving the evolution of GA biosynthesis in rhizobia.
The contrast between GA operon-containing bacterial lineages provides a captivating rationale for the further scattered distribution of cyp115 in rhizobia. In particular, the phytopathogens all contain cyp115 and are thus capable of direct production of bioactive GA4, which serves to suppress 63 the JA-induced plant defense response (Lu et al., 2015). Extension of this observation leads to the hypothesis that rhizobial production of GA4 might negatively impact the ability of the host plant to defend against microbial pathogens invading the roots or root nodules, which would compromise the efficacy of this symbiotic interaction. As such, the detrimental effect of rhizobial production of bioactive GA4 may have driven loss of cyp115. However, this would also result in a loss of GA signaling, as GA9, the product of an operon missing cyp115, presumably does not exert hormonal activity (Ueguchi-Tanaka et al., 2005). One possible mechanism to compensate for cyp115 loss would be legume host expression of the functionally-equivalent plant GA 3-oxidase (GA3ox) gene
(from endogenous plant GA metabolism) within the nodules in which the rhizobia reside. The expression of this plant gene could alleviate the necessity for rhizobial symbiont maintenance of cyp115, and would further allow the host to control the production of bioactive GA4, and thereby mount an effective defense response when necessary. Re-acquisition of cyp115 could then be driven by a lack of such GA3ox expression in nodules by certain legumes. However, this scenario remains hypothetical - though precise GA production by the plant has been shown to be critical for normal nodulation to occur (Hayashi et al., 2014; McAdam et al., 2018), coordinated biosynthesis of GA4 by rhizobia and the legume host would need to be demonstrated. This includes both the transport of
GA9 from microbe to host plant, as well as the subsequent conversion of this precursor to a bioactive
GA (e.g. GA4). Accordingly, continued study of the GA operon will provide insight into the various roles played by bacterially-produced GA in both symbiotic rhizobia-legume relationships, as well as antagonistic plant-pathogen interactions, which in turn can be expected to provide fundamental knowledge regarding the ever-expanding roles of GA signaling in plants.
General methodology and source code
A summary of the methods used herein can be found in Figure 6.2. All code and scripts used for analysis within this manuscript, as well as a general workflow for the use of the ROAGUE method in the ancestral reconstruction of gene blocks, can be found at the following GitHub repositories:
https://github.com/nguyenngochuy91/Gibberellin-Operon 64
https://github.com/nguyenngochuy91/Ancestral-Blocks-Reconstruction
Identifying orthologous gene blocks
Using Xanthomonas oryzae pv. oryzicola BLS256 (Xoc) as a reference taxon, we retrieved the
10 genes in the GA operon (cyp115, cyp112, cyp114, fdGA, sdrGA, cyp117, ggps, cps, ks, and idi). From those 10 genes, we determined whether a query strain contains orthologous gene blocks.
Using Xanthomonas oryzae pv. oryzicola BLS256 (Xoc) as a reference taxon, we retrieved the
10 genes in the GA operon (cyp115, cyp112, cyp114, fd GA , sdr GAGA , cyp117, ggps, cps, ks, and idi), for which all genes or their orthologs (except idi) have been previously experimentally validated (Lu et al., 2015; Hershey et al., 2014; Nagel et al., 2017). From those 10 genes, we determined whether a query strain contains orthologous gene blocks via BLAST searches against the NCBI non-redundant nucleotide database. Sequences were confirmed as orthologs if the BLAST e-value was 10 -10 or less. Initial BLAST analysis (April 12, 2017) revealed 166 bacterial strains within the alphaproteobacteria class (specifically: Azorhizobium, Bradyrhizobium, Mesorhizobium,
Microvirga, Rhizobium, and Ensifer/Sinorhizobium species) that contained orthologs of one or more of the GA operon genes. Given a set of 166 species/strain names, the corresponding genome assembly files were retrieved from the NCBI website. Using their assembly summary.txt file, the strains’ genomic fna (fasta nucleic acid) files were downloaded. The number of strains analyzed was further reduced by only including strains with multiple GA operon genes (>2) clustered together, resulting in a final total of 118 strains (listed in Supplemental Table 3). Retrieved genome assemblies for these strains were then annotated using Prokka (Seemann, 2014).
Identification of pseudo-cyp115 sequences
Though cyp115 is found as a full-length gene in the majority of gammaproteobacterial GA operons, as well as in some alphaproteobacterial operons, it exists as a truncated open reading frame, or gene fragment, in most alphaproteobacteria. Previous assessment of cyp115 gene fragments was performed through manual assessment of the genomic sequence 5’ to cyp112 in the GA operon 65
(Nett et al., 2017a). To identify these gene fragments (pseudo-cyp115, or p-cyp115 ) in a more streamlined process, BLAST searches were performed with the Xoc cyp115, as described above, to identify sequences with a BLAST e-value less than 1e-10 . If the length of the identified sequence was less than 60% that of the query gene and share greater than 50% sequence identity, then the sequence was annotated as p-cyp115.
Computational Reconstruction of the Gibberellin Operon Phylogeny
ROAGUE software was used to reconstruct ancestral gene blocks. Previous phylogenetic analy- sis demonstrated incongruences between phylogenetic trees constructed with a species marker (16S rRNA), a symbiotic module marker (nifK ), and GA operon genes (Hershey et al., 2014). This suggests independent horizontal gene transfer (HGT) of both the symbiotic module and the GA operon. In an effort to objectively and thoroughly assess the possibility of HGT of the GA operon among alphaproteobacteria rhizobia, phylogenetic trees were constructed using alignments from both a species marker gene (rpoB) and from a concatenation of the protein sequences for genes in the GA operon. The species tree (S) was generated by aligning rpoB (species marker) protein sequences via the MUSCLE algorithm (https://www.ebi.ac.uk/Tools/msa/muscle/), and then using the Neighbor-Joining method to generate the tree. For the operon tree (FO,PO), the protein sequences of open reading frames (ORFs) of the orthoblock genes for each species were naively concatenated. Because many species lack a full-length cyp115 and idi gene, and due to the loss of ggps in several species, only the following genes, which seem to be more uniformly conserved, were concatenated for this purpose: cyp112, cyp114, fdGA , sdrGA , cyp117, cps, and ks. Multiple sequence alignment of these concatenations was performed using the MUSCLE algorithm, and the
Neighbor-Joining method was used to build the presented trees. Erwinia tracheiphila PSU-1 was used as the phylogenetic outgroup in both trees, as this is distant phylogenetically and has been shown to have the most distant GA operon to that of the alphaproteobacteria (Nagel and Peters,
2017). ROAGUE analysis (Nguyen et al., 2019) was then applied to each phylogenetic reconstruc- tion. Each leaf node v in FS, and PS contains orthologs to the genes found in the GA operon of 66 the reference species (Xoc). For any two genes a and b, if the chromosomal distance is less than
500 bp, the genes will be written as ab. If the distance is greater than 500 bp, they are written with the separator character, thus written as a—b.
Due to the large number (118 strains) and redundancy (both phylogenetically and in operon structure) of the alphaproteobacteria strains, the size of the tree was reduced by using the Phyloge- netic Diversity Analyzer software (Chernomor et al., 2015). To facilitate analysis and presentation, the number of species was reduced to 64, which was still representative of the overall diversity, and enables ready visualization. Additionally, this software only keeps species that have a sufficiently unique sequence identity to give distinct branches on the phylogenetic tree (i.e., reflecting apprecia- ble distance between species). Accordingly, this approach also eliminated redundancy that would otherwise have confounded this analysis. The full operon tree, full species tree, partial operon tree, and partial species tree are referred to as FO, FS, PO, PS trees, respectively. The topology for each of these reconstructions was then compared in order to identify major incongruences that may indicate HGT.
Identification of gene fusion events
Currently, ROAGUE does not account for gene fusion. Given the presence of several GA operons wherein the fdGA gene is fused in frame with either the sdrGA gene or the cyp114 gene, it was necessary to assess these manually. This was accomplished by rechecking the gene block to determine if the fdGA gene was missing, as the fusion of this gene in frame with either sdrGA or cyp114 would result in it not being recognized as a unique ORF by the ROAGUE software. Then, the BLAST results were queried, and potential fusions were found by checking the start and end of the alignments for the subject genome. Strains with cyp114-fd fusions are listed in Table 6.3, and strains with fd-sdr fusions are listed in Table 6.4. 67
Deletion count: 85 Duplication count: 0 Split count: 36 a : cyp115 b : cyp112 c : cyp114 d : fd e : sdr f : cyp117 g : ggps h : cps i : ks j : idi k : ggps2 p : cyp115-pseudo/fragment KE136308.1 Erwinia tracheiphila PSU-1 a b c d e f g h i j CP003057.2 Xanthomonas oryzae BLS256(reference) a b c d e f g h i j AZYE01000987.1 Microvirga lupini Lut6 b d e f g h i j AXBA01000001.1 Azorhizobium doebereinerae UFLA1-100 b c d e f g h i !AUFA01000001.1 Bradyrhizobium sp. Cp5.3 p b c d e f g h i !KB890498.1 Bradyrhizobium sp. WSM4349 p c d f g h j b c d e f g h j !KI911783.1 Bradyrhizobium sp. WSM1417 b c d e f g h j p 99.0 b c d e f g h j [4, 0, 1] 81.0 b c d e f g h i !CM001442.1 Bradyrhizobium sp. WSM471 b c d e f g h i p b c d e f g h i j [2, 0, 0] 72.0 !JH600072.1 Bradyrhizobium sp. WSM1253 b c d e f g h i p b c d e f g h i 100.0 [0, 0, 0] [21, 0, 6] 100.0 [16, 0, 4] b c d e f g h i !AXAT01000001.1 Bradyrhizobium japonicum USDA 135 p b c d e f g h i 100.0 !AXAZ01000001.1 Bradyrhizobium sp. WSM1743 c d e f g h p [2, 0, 1] b c d e f g h i 100.0 b c d e f g h i b c d e f g h i !LJYF01000001.1 Bradyrhizobium yuanmingense BR3267 p b c e f g h i d b c d e f g h i [5, 0, 2] 66.0 b c d e f g h i 43.0 !AJQL01000001.1 Bradyrhizobium yuanmingense CCBAU 35157 e f g h i p 87.0 [16, 0, 4] b c d e f g h i [19, 0, 5] 36.0 [3, 0, 1] 39.0 [3, 0, 1] !AJQE01000001.1 Bradyrhizobium sp. CCBAU 43298 p b c d e f g h i [6, 0, 2] b c d e f g h i AXBC01000011.1 Bradyrhizobium genosp. SA-4 str. CB756 b c e f g h i 100.0 AXAD01000001.1 Bradyrhizobium sp. USDA 3384 b c d e f g h i [1, 0, 0] FMAI01000001.1 Bradyrhizobium sp. ERR11 b c d e f g h i b c d e f g h i b c d e f g h i !KB902768.1 Bradyrhizobium sp. WSM2793 b c d e f g h i p 59.0 54.0 b c d e f g h i [10, 0, 3] [0, 0, 0] 58.0 b c d e f g h i !AJQH01000001.1 Bradyrhizobium sp. CCBAU 15635 b c d e f g h i p [0, 0, 0] b c d e f g h i 87.0 !AJQG01000001.1 Bradyrhizobium sp. CCBAU 15615 p b c d e f g h i a b c d e f g h i j 87.0 [0, 0, 0] 100.0 b c d e f g h i [0, 0, 0] !AXAF01000003.1 Bradyrhizobium japonicum USDA 4 b c d e f g h i p [85, 0, 36] 100.0 ?AXAY01000001.1 Bradyrhizobium sp. WSM3983 b c d e f h i k [19, 0, 5] b c d e f g h i b c d e f g h i JRPN01000001.1 Bradyrhizobium japonicum Is-34 b c d e f g h i 23.0 b c d e f g h i b c d e f g h i 87.0 !CP007569.1 Bradyrhizobium japonicum SEMIA 5079 p b c d e f g h i [4, 0, 1] 84.0 b c d e f g h i 86.0 [0, 0, 0] [2, 0, 1] b c d e f g h i 86.0 [0, 0, 0] !JGCL01000001.1 Bradyrhizobium japonicum FN1 p b c d e f g h i b c d e f g h i 86.0 [0, 0, 0] !CP010313.1 Bradyrhizobium japonicum E109 p b c d e f g h i b c d e f g h i 89.0 [0, 0, 0] AXAG01000003.1 Bradyrhizobium japonicum USDA 38 b c d e f g h i 62.0 [0, 0, 0] !AP012206.1 Bradyrhizobium japonicum USDA 6 p b c d e f g h i [0, 0, 0] !AXVP01000001.1 Bradyrhizobium japonicum USDA 123 p b c d e f g h i b c d e f g h i 12.0 b c d e f g h i !LGUJ01000001.1 Bradyrhizobium japonicum Is-1 p b c d e f g h i [4, 0, 1] b c d e f g h i 98.0 !AP014659.1 Bradyrhizobium diazoefficiens NK6 b c d e f g h i p b c d e f g h i 98.0 [0, 0, 0] b c d e f g h i 98.0 [0, 0, 0] !BA000040.2 Bradyrhizobium diazoefficiens USDA 110 p b c d e f g h i b c d e f g h i 98.0 [0, 0, 0] !AXAX01000001.1 Bradyrhizobium japonicum USDA 122 b c d e f g h i p b c d e f g h i 100.0 [2, 0, 0] !?AXAB01000001.1 Bradyrhizobium sp. WSM2254 k b c d e f h i p 64.0 [2, 0, 0] !ADOU02000001.1 Bradyrhizobium diazoefficiens SEMIA 5080 b c d e f g h i p [2, 0, 0] !KB893805.1 Bradyrhizobium japonicum USDA 124 p b c d e f g h i !AXAU01000001.1 Bradyrhizobium elkanii WSM1741 p b c d e f g h i b c d e f g h i !AUGA01000001.1 Bradyrhizobium sp. th.b2 b c d e f g h i p 98.0 b c d e f g h i !LFIP02000001.1 Bradyrhizobium embrapense SEMIA 6208 b c d e f g h i p [3, 0, 1] 96.0 !LJYE01000001.1 Bradyrhizobium pachyrhizi BR3262 p b c d e g h i [3, 0, 1] b c d e f g h i b c d e f g h i 26.0 31.0 b c d e f g h i !LFIQ01000001.1 Bradyrhizobium pachyrhizi PAC 48 p b c d e f g h i [3, 0, 1] b c d e f g h i [1, 0, 1] b c d e f g h i 92.0 !AXAH01000003.1 Bradyrhizobium elkanii USDA 3254 p b c d e f g h i b c d e f g h i j 10.0 92.0 [0, 0, 0] 100.0 [3, 0, 1] [0, 0, 0] !AXAW01000001.1 Bradyrhizobium elkanii USDA 3259 b c d e f g h i p [84, 0, 36] !KB900701.1 Bradyrhizobium elkanii USDA 76 c e f g h i p JFGO01000001.1 Sinorhizobium americanum CCGM7 b c d e f g h i CP000874.1 Sinorhizobium fredii NGR234 b c d e f g h i c d f h i j b e g d f h i j b c e g 25.0 100.0 d f h i j e AJQN01000001.1 Sinorhizobium fredii CCBAU 25509 d f h i j e [7, 0, 2] c d f h i j b e g [4, 0, 1] d f h i j b e g 100.0 HE616890.1 Sinorhizobium fredii HH103 d f h i j e d f h i j b c e g 69.0 [0, 0, 0] 28.0 20.0 [6, 0, 1] [2, 0, 0] AMCX01000001.1 Sinorhizobium fredii GR64 b d e f g h i j [10, 0, 2] !CP003563.1 Sinorhizobium fredii USDA 257 c d f h i j e p AJQT01000001.1 Ensifer sojae CCBAU 05684 d f h i j e d f h i b c e g *AZUW01000001.1 Ensifer sp. WSM1721 b c d e f g h i a 100.0 b c d e f g h i a [17, 0, 8] 64.0 b c d e f g h i h i AZNX01000001.1 Ensifer sp. TW10 b c d e f g h i [1, 0, 1] 100.0 LATE01000001.1 Sinorhizobium sp. PC2 b c d e f g h i h i b c d e f g h i a [0, 0, 1] 10.0 *ATYB01000007.1 Sinorhizobium arboris LMG 14919 a b c d e f g i [5, 0, 5] a b c d e f g h i *ATYC01000007.1 Sinorhizobium meliloti WSM4191 b c d e f g h i j a 72.0 a b c d e f g h i j [4, 0, 3] 52.0 b c d e f g h i j ATZC01000001.1 Sinorhizobium meliloti GVPV12 b c d e f g h i j [2, 0, 1] 100.0 AQWP01000001.1 Sinorhizobium meliloti 4H41 b d e f g h i j [1, 0, 1] b c d e f g h i FMAF01000001.1 Rhizobium lusitanum P1-7 b c d e f g h i 100.0 b c d e f g h i AUFB01000001.1 Rhizobium leucaenae USDA 9039 b c d e f g h i [44, 0, 22] 100.0 b c d e f g h i [0, 0, 0] 62.0 b c d e f g h i !AQHN01000001.1 Rhizobium freirei PRF 81 b c d e f g h i p [0, 0, 0] 53.0 !CP004015.1 Rhizobium tropici CIAT 899 p b c d e f g h i [0, 0, 0]
b c d e f g h i *ATTO01000001.1 Rhizobium favelukesii OR191 a b c d e f g h i j 32.0 AEYE02000001.1 Rhizobium grahamii CCGE 502 b c d e f g h i b c d e f g h i [2, 0, 0] 100.0 b c d e f g h i [2, 0, 0] b c d e f g h i HF536772.1 Rhizobium mesoamericanum STM3625 b c d e f g h i 100.0 b c d e f g h i 96.0 ATYY01000001.1 Rhizobium mesoamericanum STM6155 b c d e f g h i [27, 0, 13] 54.0 [0, 0, 0] [4, 0, 0] b c d e f g h i *ATTQ01000001.1 Rhizobium mongolense USDA 1844 a b c d e f g h i j 100.0 CP006877.1 Rhizobium gallicum bv. gallicum R602 b c d e f g h i [2, 0, 0]
b c d e f h i j k ?CP000133.1 Rhizobium etli CFN 42 b c d e f h i j k 100.0 CP005950.1 Rhizobium etli bv. mimosae str. Mim1 b c d e f g h i [3, 0, 1] b c d e f g h i 100.0 b c d e f h i j k !?FMAJ01000001.1 Rhizobium sp. HBR26 p k b c d e f h i j [27, 0, 13] b c d e f h i j k 5.0 ?CP001074.1 Rhizobium etli CIAT 652 b c d e f h i j k 83.0 b c d e f g h i j b c d e f h i j k [0, 0, 2] [8, 0, 7] 100.0 6.0 [63, 0, 30] [0, 0, 3] b c d e f h i j k ?JFGP01000001.1 Rhizobium leguminosarum bv. phaseoli CCGM1 b c d e f h i j k b c d e f h i j k 90.0 ?AHJU02000001.1 Rhizobium phaseoli Ch24-10 b c d e f h i j k 93.0 [0, 0, 0] [5, 0, 6] b c d e f h i j k b c d e f h i j k ?CP007641.1 Rhizobium etli bv. phaseoli str. IE4803 b e f h i j k 100.0 35.0 !CP006986.1 Rhizobium etli bv. mimosae str. IE4771 p b c d e f g h i [20, 0, 13] [5, 0, 3] !?AEYF01000001.1 Rhizobium sp. CCGE 510 b c d e f h i j k p b c d e f h i j k 40.0 b c d e f h i j k !?LFIO01000001.1 Rhizobium ecuadorense CNPSO 671 p b c d e f h i j k [0, 0, 0] 34.0 ?LJSR01000001.1 Rhizobium acidisoli FH23 b c d e f h i j k [0, 0, 0] b c d e f h i j k 89.0 b c d e f g h i j ?KB905373.1 Rhizobium leguminosarum bv. phaseoli 4292 b e f h i j k [12, 0, 6] 100.0 KI911771.1 Rhizobium leguminosarum bv. trifolii CC283b b c d e f g h j [5, 0, 2] b c d e f g h i j 60.0 b d e f g h i j ?ATTN01000001.1 Rhizobium leguminosarum bv. phaseoli FA23 b e f h i j k [10, 0, 5] b c d e f g h i j 80.0 ATYQ01000001.1 Rhizobium leguminosarum bv. viciae VF39 b d e f g h i j 46.0 [3, 0, 1] [5, 0, 3] AJUI01000023.1 Rhizobium leguminosarum bv. trifolii CC278f b c d e f g h j AHAM01000001.1 Mesorhizobium alhagi CCNWXJ12-2 b c d e f g h i j ATYO01000001.1 Mesorhizobium sp. WSM3224 b c d e f g h i j b c d e f g h i j 100.0 a b c d e f g h i j *AZUX01000001.1 Mesorhizobium sp. WSM2561 a b c d e f g h i j [18, 0, 8] a b c d e f g h i j 100.0 AYWU01000001.1 Mesorhizobium sp. L48C026A00 b c d e f g h i j b c d e f g h i j 98.0 [1, 0, 1] 100.0 [1, 0, 1] *LYTO01000001.1 Mesorhizobium sp. AA22 a b c d e f g h i j [18, 0, 8] AGSN01000001.1 Mesorhizobium amorphae CCNWGS0123 b c d e f g h i j *CAAF010000001.1Mesorhizobium sp. STM 4661 a b c d e f g h i j b c d e f g h i j 86.0 a b c d e f g h i j *LYTJ01000001.1 Mesorhizobium loti NZP2014 a b c d e f g h i j [18, 0, 8] 100.0 *AXAE01000005.1 Mesorhizobium erdmanii USDA 3471 a b c d e f g h i j a b c d e f g h i j [0, 0, 0] 16.0 *LYTK01000001.1 Mesorhizobium loti NZP2042 b c d e f g h i j a b c d e f g h i j [1, 0, 1] a b c d e f g h i j !KB913026.1 Mesorhizobium loti NZP2037 b c d e f g h i j p 42.0 75.0 a b c d e f g h i j [16, 0, 7] [1, 0, 1] a b c d e f g h i j *MDLH01000001.1 Mesorhizobium sp. SEMIA 3007 a b c d e f g h i j a b c d e f g h i j 100.0 a b c d e f g h i j 64.0 *KI632510.1 Mesorhizobium loti R7A a b c d e f g h i j 46.0 [1, 0, 1] 64.0 [0, 0, 0] [6, 0, 1] [0, 0, 0] *AP003017.1 Mesorhizobium loti MAFF303099 a b c d e f g h i j
b c d e f g h i j b c d e f g h i j a b c d e f g h i j CP003358.1 Mesorhizobium australicum WSM2073 d f g h i j 41.0 42.0 51.0 *AZUY01000001.1 Mesorhizobium sp. WSM3626 a b c d e f g h j [16, 0, 7] [10, 0, 3] [5, 0, 0]
b c d e f g h i j b c d e f g h i j !AYWP01000001.1 Mesorhizobium sp. LNHC232B00 b c d e f g h i j p 45.0 95.0 CP002279.1 Mesorhizobium opportunistum WSM2075 d f g h i j [13, 0, 5] [3, 0, 1] b c d e f g h i j !AZUV01000001.1 Mesorhizobium sp. WSM1293 b c d e f g h i j p b c d e f g h i j b c d e f g h i j 100.0 CP002447.1 Mesorhizobium ciceri biovar biserrulae WSM1271 d f g h i j 67.0 100.0 [3, 0, 1] [14, 0, 6] [3, 0, 1] !CP015064.1 Mesorhizobium ciceri biovar biserrulae WSM1284 b c d e f g h i j p
b c d e f g h i j a b c d e f g h i j *KI912159.1 Mesorhizobium loti R88b a b c d e f g h i j 13.0 100.0 *AXAL01000001.1 Mesorhizobium loti CJ3sym a b c d e f g h i j [15, 0, 7] [0, 0, 0] AYVY01000001.1 Mesorhizobium sp. LSHC420B00 b c d e f g h j b c d e f g h i j AYWZ01000001.1 Mesorhizobium sp. L2C066B000 b c d e f g h i j 35.0 b c d e f g h i j AYXF01000001.1 Mesorhizobium sp. L103C105A0 b c d e f g h i j [1, 0, 0] 100.0 b c d e f g h i j AYVN01000001.1 Mesorhizobium sp. LSJC269B00 b c d e f g h i j [0, 0, 0] 67.0 b c d e f g h i j AYVM01000001.1 Mesorhizobium sp. LSJC277A00 b c d e f g h i j [0, 0, 0] 56.0 b c d e f g h i j AYWA01000001.1 Mesorhizobium sp. LSHC414A00 b c d e f g h i j [0, 0, 0] 50.0 b c d e f g h i j AYVP01000001.1 Mesorhizobium sp. LSJC265A00 b c d e f g h i j [0, 0, 0] 56.0 b c d e f g h i j AYVX01000001.1 Mesorhizobium sp. LSHC422A00 b c d e f g h i j [0, 0, 0] 56.0 b c d e f g h i j AYWW01000001.1 Mesorhizobium sp. L2C085B000 b c d e f g h i j [0, 0, 0] 56.0 b c d e f g h i j AYWB01000001.1 Mesorhizobium sp. LSHC412B00 b c d e f g h i j [0, 0, 0] 56.0 b c d e f g h i j AYVO01000001.1 Mesorhizobium sp. LSJC268A00 b c d e f g h i j [0, 0, 0] 56.0 b c d e f g h i j [0, 0, 0] 56.0 b c d e f g h i j AYWN01000001.1 Mesorhizobium sp. LNJC372A00 b c d e f g h i j [0, 0, 0] 56.0 AYVS01000001.1 Mesorhizobium sp. LSHC440B00 b c d e f g h i j [0, 0, 0] 0.09
Figure 6.6: Ancestral reconstruction of the GA biosynthetic operon using rpoB for phylogenetic analysis. A phylogenetic tree was constructed with alignments of rpoB protein sequences from 118 α-rhizobia species and two γ-proteobacteria using the Neighbor-Joining method as a measure of the distance between species. ROAGUE was then applied to create the ancestral operon reconstruction. Annotations are defined in section 6.1. 68
Deletion count: 55 Duplication count: 0 Split count: 20 a : cyp115 b : cyp112 c : cyp114 d : fd e : sdr f : cyp117 g : ggps h : cps i : ks j : idi k : ggps2 p : cyp115-pseudo/fragment KE136308.1 Erwinia tracheiphila PSU-1 a b c d e f g h i j AEYE02000001.1 Rhizobium grahamii CCGE 502 b c d e f g h i b c d e f g h i 100.0 b c d e f g h i HF536772.1 Rhizobium mesoamericanum STM3625 b c d e f g h i [0, 0, 0] 100.0 ATYY01000001.1 Rhizobium mesoamericanum STM6155 b c d e f g h i [0, 0, 0] CP003057.2 Xanthomonas oryzae BLS256(reference) a b c d e f g h i j AXBA01000001.1 Azorhizobium doebereinerae UFLA1-100 b c d e f g h i b c d e f g h i j 100.0 c e f g h i !AJQL01000001.1 Bradyrhizobium yuanmingense CCBAU 35157 e f g h i p [55, 0, 20] c e f g h i 100.0 !AXAZ01000001.1 Bradyrhizobium sp. WSM1743 c d e f g h p 31.0 [3, 0, 1] [3, 0, 1] !KB900701.1 Bradyrhizobium elkanii USDA 76 c e f g h i p !KB890498.1 Bradyrhizobium sp. WSM4349 p c d f g h j c d e f g h i j c d i f g h j !CP003563.1 Sinorhizobium fredii USDA 257 c d f h i j e p b c d e f g h i j 100.0 37.0 d f h i j c e AJQT01000001.1 Ensifer sojae CCBAU 05684 d f h i j e 67.0 [11, 0, 2] [4, 0, 0] 100.0 d f h i j e [54, 0, 20] [1, 0, 0] 100.0 d f h i j e AJQN01000001.1 Sinorhizobium fredii CCBAU 25509 d f h i j e c d f g h i j [0, 0, 0] 100.0 HE616890.1 Sinorhizobium fredii HH103 d f h i j e 37.0 [0, 0, 0] [5, 0, 0] CP003358.1 Mesorhizobium australicum WSM2073 d f g h i j d f g h i j 100.0 d f g h i j CP002279.1 Mesorhizobium opportunistum WSM2075 d f g h i j b c d e f g h i j [0, 0, 0] 73.0 CP002447.1 Mesorhizobium ciceri biovar biserrulae WSM1271 d f g h i j 100.0 [0, 0, 0] [53, 0, 20] b c d e f g h i j ATYQ01000001.1 Rhizobium leguminosarum bv. viciae VF39 b d e f g h i j 100.0 KI911771.1 Rhizobium leguminosarum bv. trifolii CC283b b c d e f g h j [2, 0, 1]
b c d e f h i j k !?AEYF01000001.1 Rhizobium sp. CCGE 510 b c d e f h i j k p 100.0 ?LJSR01000001.1 Rhizobium acidisoli FH23 b c d e f h i j k [0, 0, 0]
k b c d e f h i j !?FMAJ01000001.1 Rhizobium sp. HBR26 p k b c d e f h i j k b c d e f h i j 100.0 ?CP007641.1 Rhizobium etli bv. phaseoli str. IE4803 b e f h i j k 100.0 [2, 0, 1] [6, 0, 5] b c d e f h i j k ?CP001074.1 Rhizobium etli CIAT 652 b c d e f h i j k k b c d e f h i j 79.0 !?LFIO01000001.1 Rhizobium ecuadorense CNPSO 671 p b c d e f h i j k [0, 0, 1] b c d e f g h i j 76.0 b c d e f h i j k 47.0 [6, 0, 4] 66.0 b c d e f g h i j b c d e f h i j k ?AHJU02000001.1 Rhizobium phaseoli Ch24-10 b c d e f h i j k [52, 0, 20] [2, 0, 2] 100.0 b c d e f h i j k b c d e f h i j k 92.0 ?CP000133.1 Rhizobium etli CFN 42 b c d e f h i j k [9, 0, 6] 80.0 90.0 [0, 0, 0] b c d e f h i j k [2, 0, 2] [2, 0, 1] ?ATTN01000001.1 Rhizobium leguminosarum bv. phaseoli FA23 b e f h i j k 100.0 ?JFGP01000001.1 Rhizobium leguminosarum bv. phaseoli CCGM1 b c d e f h i j k [4, 0, 3] ?KB905373.1 Rhizobium leguminosarum bv. phaseoli 4292 b e f h i j k b c d e f g h i j b c d e f g h i !CP006986.1 Rhizobium etli bv. mimosae str. IE4771 p b c d e f g h i 63.0 100.0 CP005950.1 Rhizobium etli bv. mimosae str. Mim1 b c d e f g h i [11, 0, 7] [0, 0, 0]
b c d e f g h i j AJUI01000023.1 Rhizobium leguminosarum bv. trifolii CC278f b c d e f g h j 23.0 ATYO01000001.1 Mesorhizobium sp. WSM3224 b c d e f g h i j b c d e f g h i b c d e f g h i j [1, 0, 0] 100.0 82.0 [12, 0, 8] [2, 0, 1] a b c d e f g h i j *ATYC01000007.1 Sinorhizobium meliloti WSM4191 b c d e f g h i j a 43.0 *ATTO01000001.1 Rhizobium favelukesii OR191 a b c d e f g h i j [0, 0, 1] b c d e f g h i !LJYF01000001.1 Bradyrhizobium yuanmingense BR3267 p b c e f g h i d 85.0 b c d e f g h i [13, 0, 8] 100.0 b c d e f g h i FMAI01000001.1 Bradyrhizobium sp. ERR11 b c d e f g h i [0, 0, 1] 90.0 !KB902768.1 Bradyrhizobium sp. WSM2793 b c d e f g h i p [0, 0, 0] b c d e f g h i 27.0 b c d e f g h i AXBC01000011.1 Bradyrhizobium genosp. SA-4 str. CB756 b c e f g h i [13, 0, 8] b c d e f g h i j b c d e f g h i 100.0 AXAD01000001.1 Bradyrhizobium sp. USDA 3384 b c d e f g h i 17.0 14.0 [1, 0, 0] [40, 0, 18] b c d e f g h i [13, 0, 8] !LFIP02000001.1 Bradyrhizobium embrapense SEMIA 6208 b c d e f g h i p 2.0 !AXAU01000001.1 Bradyrhizobium elkanii WSM1741 p b c d e f g h i [14, 0, 9] !LJYE01000001.1 Bradyrhizobium pachyrhizi BR3262 p b c d e g h i !AXAT01000001.1 Bradyrhizobium japonicum USDA 135 p b c d e f g h i !AP014659.1 Bradyrhizobium diazoefficiens NK6 b c d e f g h i p !AJQG01000001.1 Bradyrhizobium sp. CCBAU 15615 p b c d e f g h i b c d e f g h i !AP012206.1 Bradyrhizobium japonicum USDA 6 p b c d e f g h i 99.0 !BA000040.2 Bradyrhizobium diazoefficiens USDA 110 p b c d e f g h i [0, 0, 0] b c d e f g h i !CP010313.1 Bradyrhizobium japonicum E109 p b c d e f g h i b c d e f g h i b c d e f g h i 60.0 b c d e f g h i b c d e f g h i b c d e f g h i 2.0 29.0 [0, 0, 0] b c d e f g h i !CP007569.1 Bradyrhizobium japonicum SEMIA 5079 p b c d e f g h i 86.0 93.0 42.0 [14, 0, 9] [0, 0, 0] b c d e f g h i 41.0 !JGCL01000001.1 Bradyrhizobium japonicum FN1 p b c d e f g h i [0, 0, 0] [0, 0, 0] [0, 0, 0] b c d e f g h i 41.0 [0, 0, 0] 48.0 [0, 0, 0] JRPN01000001.1 Bradyrhizobium japonicum Is-34 b c d e f g h i b c d e f g h i [0, 0, 0] 59.0 b c d e f g h i AXAG01000003.1 Bradyrhizobium japonicum USDA 38 b c d e f g h i b c d e f g h i [0, 0, 0] 100.0 b c d e f g h i 86.0 !AXAF01000003.1 Bradyrhizobium japonicum USDA 4 b c d e f g h i p 54.0 [0, 0, 0] b c d e f g h i [0, 0, 0] [0, 0, 0] !AJQH01000001.1 Bradyrhizobium sp. CCBAU 15635 b c d e f g h i p 38.0 !AJQE01000001.1 Bradyrhizobium sp. CCBAU 43298 p b c d e f g h i [0, 0, 0] !AXVP01000001.1 Bradyrhizobium japonicum USDA 123 p b c d e f g h i
b c d e f g h i b c d e f g h i !AXAX01000001.1 Bradyrhizobium japonicum USDA 122 b c d e f g h i p 99.0 b c d e f g h i 97.0 !LGUJ01000001.1 Bradyrhizobium japonicum Is-1 p b c d e f g h i [0, 0, 0] b c d e f g h i 97.0 [0, 0, 0] 80.0 [0, 0, 0] !ADOU02000001.1 Bradyrhizobium diazoefficiens SEMIA 5080 b c d e f g h i p [0, 0, 0] !KB893805.1 Bradyrhizobium japonicum USDA 124 p b c d e f g h i
b c d e f g h i !AXAH01000003.1 Bradyrhizobium elkanii USDA 3254 p b c d e f g h i b c d e f g h i 63.0 !AXAW01000001.1 Bradyrhizobium elkanii USDA 3259 b c d e f g h i p 100.0 [0, 0, 0] [0, 0, 0] !LFIQ01000001.1 Bradyrhizobium pachyrhizi PAC 48 p b c d e f g h i
b c d e f g h i j !CP015064.1 Mesorhizobium ciceri biovar biserrulae WSM1284 b c d e f g h i j p 100.0 !AZUV01000001.1 Mesorhizobium sp. WSM1293 b c d e f g h i j p [0, 0, 0]
b c d e f g h i FMAF01000001.1 Rhizobium lusitanum P1-7 b c d e f g h i b c d e f g h i 100.0 AUFB01000001.1 Rhizobium leucaenae USDA 9039 b c d e f g h i b c d e f g h i j b c d e f g h i 100.0 [0, 0, 0] 3.0 100.0 [0, 0, 0] !AQHN01000001.1 Rhizobium freirei PRF 81 b c d e f g h i p [38, 0, 17] b c d e f g h i [0, 0, 0] !CP004015.1 Rhizobium tropici CIAT 899 p b c d e f g h i 87.0 [2, 0, 1] b c d e f g h i CP006877.1 Rhizobium gallicum bv. gallicum R602 b c d e f g h i b c d e f g h i 96.0 AZYE01000987.1 Microvirga lupini Lut6 b d e f g h i j 54.0 [2, 0, 1] [5, 0, 3] JFGO01000001.1 Sinorhizobium americanum CCGM7 b c d e f g h i b c d e f g h i b c d e f g h i j 96.0 b d e f g h i j AMCX01000001.1 Sinorhizobium fredii GR64 b d e f g h i j 56.0 [3, 0, 2] b d e f g h i j 100.0 ATZC01000001.1 Sinorhizobium meliloti GVPV12 b c d e f g h i j [7, 0, 3] 100.0 [1, 0, 1] [1, 0, 1] AQWP01000001.1 Sinorhizobium meliloti 4H41 b d e f g h i j AYVY01000001.1 Mesorhizobium sp. LSHC420B00 b c d e f g h j !AYWP01000001.1 Mesorhizobium sp. LNHC232B00 b c d e f g h i j p b c d e f g h i j 100.0 b c d e f g h i j AYWB01000001.1 Mesorhizobium sp. LSHC412B00 b c d e f g h i j [0, 0, 0] 100.0 AYVP01000001.1 Mesorhizobium sp. LSJC265A00 b c d e f g h i j [0, 0, 0]
b c d e f g h i j AYXF01000001.1 Mesorhizobium sp. L103C105A0 b c d e f g h i j 41.0 AYWN01000001.1 Mesorhizobium sp. LNJC372A00 b c d e f g h i j b c d e f g h i j [0, 0, 0] 100.0 [0, 0, 0] b c d e f g h i j b c d e f g h i j AYVS01000001.1 Mesorhizobium sp. LSHC440B00 b c d e f g h i j 58.0 100.0 AYWZ01000001.1 Mesorhizobium sp. L2C066B000 b c d e f g h i j [0, 0, 0] b c d e f g h i j [0, 0, 0] b c d e f g h i j 77.0 AYWW01000001.1 Mesorhizobium sp. L2C085B000 b c d e f g h i j 20.0 [0, 0, 0] b c d e f g h i j [10, 0, 5] b c d e f g h i j b c d e f g h i j b c d e f g h i j 78.0 b c d e f g h i j AYVN01000001.1 Mesorhizobium sp. LSJC269B00 b c d e f g h i j b c d e f g h i j 29.0 91.0 61.0 [0, 0, 0] 99.0 AYVO01000001.1 Mesorhizobium sp. LSJC268A00 b c d e f g h i j 21.0 [0, 0, 0] [23, 0, 8] [0, 0, 0] [0, 0, 0] [0, 0, 0] AYVM01000001.1 Mesorhizobium sp. LSJC277A00 b c d e f g h i j
b c d e f g h i j AYWA01000001.1 Mesorhizobium sp. LSHC414A00 b c d e f g h i j 62.0 AYVX01000001.1 Mesorhizobium sp. LSHC422A00 b c d e f g h i j b c d e f g h i j a [0, 0, 0] 18.0 AYWU01000001.1 Mesorhizobium sp. L48C026A00 b c d e f g h i j [2, 0, 2] !KB913026.1 Mesorhizobium loti NZP2037 b c d e f g h i j p b c d e f g h i j a 100.0 b c d e f g h i j a *LYTJ01000001.1 Mesorhizobium loti NZP2014 a b c d e f g h i j b c d e f g h i j [1, 0, 1] 89.0 *LYTK01000001.1 Mesorhizobium loti NZP2042 b c d e f g h i j a 20.0 b c d e f g h i j a [0, 0, 0] [11, 0, 5] 100.0 *AXAE01000005.1 Mesorhizobium erdmanii USDA 3471 a b c d e f g h i j [1, 0, 1] *CAAF010000001.1Mesorhizobium sp. STM 4661 a b c d e f g h i j a b c d e f g h i j a b c d e f g h i j a b c d e f g h i j 56.0 33.0 a b c d e f g h i j *AP003017.1 Mesorhizobium loti MAFF303099 a b c d e f g h i j [2, 0, 2] 75.0 [0, 0, 0] [0, 0, 0] 100.0 *MDLH01000001.1 Mesorhizobium sp. SEMIA 3007 a b c d e f g h i j a b c d e f g h i j [0, 0, 0] 95.0 b c d e f g h i j [0, 0, 0] a b c d e f g h i j *AXAL01000001.1 Mesorhizobium loti CJ3sym a b c d e f g h i j 2.0 a b c d e f g h i j 100.0 *KI632510.1 Mesorhizobium loti R7A a b c d e f g h i j [12, 0, 5] 100.0 [0, 0, 0] [0, 0, 0] *KI912159.1 Mesorhizobium loti R88b a b c d e f g h i j *ATTQ01000001.1 Rhizobium mongolense USDA 1844 a b c d e f g h i j b c d e f g h i j b c d e f g h i !JH600072.1 Bradyrhizobium sp. WSM1253 b c d e f g h i p 1.0 100.0 !CM001442.1 Bradyrhizobium sp. WSM471 b c d e f g h i p [14, 0, 6] [0, 0, 0] !KI911783.1 Bradyrhizobium sp. WSM1417 b c d e f g h j p
a i b c d e f g h j *AZUX01000001.1 Mesorhizobium sp. WSM2561 a b c d e f g h i j 100.0 *AZUY01000001.1 Mesorhizobium sp. WSM3626 a b c d e f g h j [1, 0, 0]
b c d e f g h i j b c d e f g h i LATE01000001.1 Sinorhizobium sp. PC2 b c d e f g h i h i 1.0 b c d e f g h i 100.0 AZNX01000001.1 Ensifer sp. TW10 b c d e f g h i [23, 0, 8] 46.0 [0, 0, 1] b c d e f g h i [0, 0, 1] CP000874.1 Sinorhizobium fredii NGR234 b c d e f g h i 23.0 [2, 0, 2] b c d e f g h i a *AZUW01000001.1 Ensifer sp. WSM1721 b c d e f g h i a 98.0 *ATYB01000007.1 Sinorhizobium arboris LMG 14919 a b c d e f g i [1, 0, 1] *LYTO01000001.1 Mesorhizobium sp. AA22 a b c d e f g h i j b c d e f g h i b c d e f g h i 2.0 19.0 k b c d e f h i !?AXAB01000001.1 Bradyrhizobium sp. WSM2254 k b c d e f h i p [8, 0, 2] [4, 0, 0] 86.0 ?AXAY01000001.1 Bradyrhizobium sp. WSM3983 b c d e f h i k b c d e f g h i [0, 0, 0] 2.0 [5, 0, 0] b c d e f g h i !AUGA01000001.1 Bradyrhizobium sp. th.b2 b c d e f g h i p b c d e f g h i 34.0 AGSN01000001.1 Mesorhizobium amorphae CCNWGS0123 b c d e f g h i j 1.0 [1, 0, 0] [6, 0, 0] b c d e f g h i !AUFA01000001.1 Bradyrhizobium sp. Cp5.3 p b c d e f g h i 7.0 AHAM01000001.1 Mesorhizobium alhagi CCNWXJ12-2 b c d e f g h i j [1, 0, 0] 0.11
Figure 6.7: Ancestral reconstruction of the GA biosynthetic operon using the concatenated operon for phylogenetic analysis. A phylogenetic tree was constructed with alignments of concatenated proteins from core GA operon genes (cyp112-cyp114-fd-sdr-cyp117-cps-ks) from 118 α-rhizobia species and two β-proteobacteria using the Neighbor-Joining method. ROAGUE was then applied to create the ancestral operon reconstruction. ROAGUE was then applied to create the ancestral operon reconstruction. Annotations are defined in section 6.1 69
6.3 Appendix: Supplementary Materials
Table 6.1: Pairwise alignments of GGPS2 proteins and representative GGPS proteins. Protein sequences were aligned using the MUSCLE algorithm (Geneious Prime; default settings). The GGPS proteins included in this analysis were selected due to the relatively close phylogenetic relationship between the core operon of this strain and that of the ggps2-containing strains, which lack a full-length ggps gene (i.e. Rhizobium etli bv. mimosae str. Mim1 has the most similar core operon to the ggps2-containing Rhizobium strains and Bradyrhizobium sp. WSM1417 has the most similar core operon to ggps2-containing Bradyrhizobium strains). Sequences with greater 90% amino acid identity are highlighted in dark blue, while those with greater 80% but less than 90% amino acid identity are highlighted in light blue
Rhizobium etli Rhizobium sp. Rhizobium sp. Rhizobium etli Bradyrhizobium sp. Bradyrhizobium sp. Bradyrhizobium sp. bv. mimosae CCGE 510 HBR26 CFN 42 WSM2254 WSM3983 WSM1417 str. Mim1
Rhizobium sp.
CCGE 510
Rhizobium sp. 95 HBR26
Rhizobium etli 96 97 CFN 42
Bradyrhizobium sp. 83 84 84 WSM2254
Bradyrhizobium sp. 83 84 84 94 WSM3983
Rhizobium etli
bv. mimosae 28 29 29 30 30
str. Mim1
Bradyrhizobium sp. 28 29 29 29 29 88 WSM1417 70
Table 6.2: List of strains used within the ancestral reconstruction analyses. Shown for each strain is the corresponding GenBank genome accession, the legume host from which it was isolated, and the nodule type (D = determinate, I = indeterminate, * = neither). Note that the legume host and nodule type are irrelevant for the gammaproteobacteria plant pathogens analyzed in these reconstructions.
Genome acces- Strain Legume Type
sion host
AXBA01000001.1 Azorhizobium Sesbania *
doebereinerae virgata
UFLA1-100
AP014659.1 Bradyrhizobium Glycine D
diazoefficiens max
NK6
ADOU02000001.1 Bradyrhizobium Glycine D
diazoefficiens max
SEMIA 5080
BA000040.2 Bradyrhizobium Glycine D
diazoefficiens max
USDA 110
AXAH01000003.1 Bradyrhizobium Phaseolus D
elkanii USDA acutifolius
3254
AXAW01000001.1 Bradyrhizobium Phaseolus D
elkanii USDA lunatus
3259
KB900701.1 Bradyrhizobium Glycine D
elkanii USDA 76 max 71
Table 6.2 Continued
Genome acces- Strain Legume Type sion host
AXAU01000001.1 Bradyrhizobium Rhynchosia D
elkanii WSM1741 minima
LFIP02000001.1 Bradyrhizobium Desmodium D
embrapense heterocar-
SEMIA 6208 pon
AXBC01000011.1 Bradyrhizobium Macrotyloma D
genosp. SA-4 str. africanum
CB756
CP010313.1 Bradyrhizobium Glycine D
japonicum E109 max
JGCL01000001.1 Bradyrhizobium Glycine D
japonicum FN1 max (soil
isolate)
LGUJ01000001.1 Bradyrhizobium Glycine D
japonicum Is-1 max
JRPN01000001.1 Bradyrhizobium Glycine D
japonicum Is-34 max
CP007569.1 Bradyrhizobium Glycine D
japonicum max
SEMIA 5079
AXAX01000001.1 Bradyrhizobium Glycine D
japonicum USDA max
122 72
Table 6.2 Continued
Genome acces- Strain Legume Type sion host
AXVP01000001.1 Bradyrhizobium Glycine D
japonicum USDA max
123
KB893805.1 Bradyrhizobium Glycine D
japonicum USDA max
124
AXAT01000001.1 Bradyrhizobium Glycine D
japonicum USDA max
135
AXAG01000003.1 Bradyrhizobium Glycine D
japonicum USDA max
38
AXAF01000003.1 Bradyrhizobium Glycine D
japonicum USDA max
4
AP012206.1 Bradyrhizobium Glycine D
japonicum USDA max
6
LJYE01000001.1 Bradyrhizobium Vigna D
pachyrhizi unguiculata
BR3262
LFIQ01000001.1 Bradyrhizobium Pachyrhizus D
pachyrhizi PAC erosus
48 73
Table 6.2 Continued
Genome acces- Strain Legume Type sion host
AJQG01000001.1 Bradyrhizobium Glycine D
sp. CCBAU max
15615
AJQH01000001.1 Bradyrhizobium Glycine D
sp. CCBAU max
15635
AJQE01000001.1 Bradyrhizobium Glycine D
sp. CCBAU max
43298
AUFA01000001.1 Bradyrhizobium Centrosema D
sp. Cp5.3 pubescens
FMAI01000001.1 Bradyrhizobium Erythrina D
sp. ERR11 brucei
AUGA01000001.1 Bradyrhizobium AmphicarpaeaD
sp. th.b2 bracteata
AXAD01000001.1 Bradyrhizobium Kennedia D
sp. USDA 3384 coccinea
JH600072.1 Bradyrhizobium Ornithopus D
sp. WSM1253 compressus
KI911783.1 Bradyrhizobium Lupinus sp. I
sp. WSM1417
AXAZ01000001.1 Bradyrhizobium Indigofera I
sp. WSM1743 sp. 74
Table 6.2 Continued
Genome acces- Strain Legume Type sion host
AXAB01000001.1 Bradyrhizobium Acacia I
sp. WSM2254 dealbata
KB902768.1 Bradyrhizobium Rhynchosia D
sp. WSM2793 totta
AXAY01000001.1 Bradyrhizobium Kennedia D
sp. WSM3983 coccinea
KB890498.1 Bradyrhizobium Syrmatium D*
sp. WSM4349 glabrum
CM001442.1 Bradyrhizobium Ornithopus D
sp. WSM471 pinnatus
LJYF01000001.1 Bradyrhizobium Kennedia D
yuanmingense coccinea
BR3267
AJQL01000001.1 Bradyrhizobium Glycine D
yuanmingense max
CCBAU 35157
AJQT01000001.1 Ensifer sojae Glycine D
CCBAU 5684 max
AZNX01000001.1 Ensifer sp. TW10 Tephrosia I
purpurea
AZUW01000001.1 Ensifer sp. Indigofera I
WSM1721 sp. 75
Table 6.2 Continued
Genome acces- Strain Legume Type sion host
AHAM01000001.1 Mesorhizobium Alhagi I
alhagi sparsifolia
CCNWXJ12-
2
AGSN01000001.1 Mesorhizobium Robinia I
amorphae CCN- pseudoaca-
WGS0123 cia
CP003358.1 Mesorhizobium Biserrula I
australicum pelecinus
WSM2073
CP002447.1 Mesorhizobium Biserrula I
ciceri bv. biserru- pelecinus
lae WSM1271
CP015064.1 Mesorhizobium Biserrula I
ciceri bv. biserru- pelecinus
lae WSM1284
AXAE01000005.1 Mesorhizobium Lotus cor- D
erdmanii USDA niculatus
3471
AXAL01000001.1 Mesorhizobium Lotus cor- D
loti CJ3sym niculatus
AP003017.1 Mesorhizobium Lotus pe- D
loti MAFF303099 dunculatus 76
Table 6.2 Continued
Genome acces- Strain Legume Type sion host
LYTJ01000001.1 Mesorhizobium Lotus sp. D
loti NZP2014
KB913026.1 Mesorhizobium Lotus D
loti NZP2037 divaricatus
LYTK01000001.1 Mesorhizobium Lotus sp. D
loti NZP2042
KI632510.1 Mesorhizobium Lotus cor- D
loti R7A niculatus
KI912159.1 Mesorhizobium Lotus cor- D
loti R88b niculatus
CP002279.1 Mesorhizobium Biserrula I
opportunistum pelecinus
WSM2075
LYTO01000001.1 Mesorhizobium Biserrula I
sp. AA22 pelecinus L
AYXF01000001.1 Mesorhizobium Acmispon D
sp. L103C105A0 wrangelianus
AYWZ01000001.1 Mesorhizobium Acmispon D
sp. L2C066B000 wrangelianus
AYWW01000001.1 Mesorhizobium Acmispon D
sp. L2C085B000 wrangelianus
AYWU01000001.1 Mesorhizobium Acmispon D
sp. L48C026A00 wrangelianus 77
Table 6.2 Continued
Genome acces- Strain Legume Type sion host
AYWP01000001.1 Mesorhizobium Acmispon D
sp. LNHC232B00 wrangelianus
AYWN01000001.1 Mesorhizobium Acmispon D
sp. LNJC372A00 wrangelianus
AYWB01000001.1 Mesorhizobium Acmispon D
sp. LSHC412B00 wrangelianus
AYWA01000001.1 Mesorhizobium Acmispon D
sp. LSHC414A00 wrangelianus
AYVY01000001.1 Mesorhizobium Acmispon D
sp. LSHC420B00 wrangelianus
AYVX01000001.1 Mesorhizobium Acmispon D
sp. LSHC422A00 wrangelianus
AYVS01000001.1 Mesorhizobium Acmispon D
sp. LSHC440B00 wrangelianus
AYVP01000001.1 Mesorhizobium Acmispon D
sp. LSJC265A00 wrangelianus
AYVO01000001.1 Mesorhizobium Acmispon D
sp. LSJC268A00 wrangelianus
AYVN01000001.1 Mesorhizobium Acmispon D
sp. LSJC269B00 wrangelianus
AYVM01000001.1 Mesorhizobium Acmispon D
sp. LSJC277A00 wrangelianus
MDLH01000001.1 Mesorhizobium Pisum I
sp. SEMIA 3007 sativum 78
Table 6.2 Continued
Genome acces- Strain Legume Type sion host
CAAF010000001.1 Mesorhizobium Anthyllis D
sp. STM 4661 vulneraria
AZUV01000001.1 Mesorhizobium Lotus sp. D
sp. WSM1293
AZUX01000001.1 Mesorhizobium Lessertia I
sp. WSM2561 diffusa
ATYO01000001.1 Mesorhizobium Otholobium D
sp. WSM3224 candicans
AZUY01000001.1 Mesorhizobium Lessertia I
sp. WSM3626 diffusa
AZYE01000987.1 Microvirga lupini Lupinus I
Lut6 texensis
LJSR01000001.1 Rhizobium acidis- Phaseolus D
oli FH23 vulgaris
LFIO01000001.1 Rhizobium Phaseolus D
ecuadorense vulgaris
CNPSO 671
CP006986.1 Rhizobium etli Phaseolus D
bv. mimosae str. vulgaris
IE4771
CP005950.1 Rhizobium etli Mimosa I
bv. mimosae str. affinis
Mim1 79
Table 6.2 Continued
Genome acces- Strain Legume Type sion host
CP007641.1 Rhizobium etli Phaseolus D
bv. phaseoli str. vulgaris
IE4803
CP000133.1 Rhizobium etli Phaseolus D
CFN 42 vulgaris
CP001074.1 Rhizobium etli Phaseolus D
CIAT 652 vulgaris
ATTO01000001.1 Rhizobium Medicago I
favelukesii OR191 sativa
AQHN01000001.1 Rhizobium freirei Phaseolus D
PRF 81 vulgaris
CP006877.1 Rhizobium gal- Phaseolus D
licum bv. gal- vulgaris
licum R602
AEYE02000001.1 Rhizobium gra- Dalea I
hamii CCGE leporin
502
KB905373.1 Rhizobium legu- Phaseolus D
minosarum bv. vulgaris
phaseoli 4292
JFGP01000001.1 Rhizobium legu- Phaseolus D
minosarum bv. vulgaris
phaseoli CCGM1 80
Table 6.2 Continued
Genome acces- Strain Legume Type sion host
ATTN01000001.1 Rhizobium legu- Phaseolus D
minosarum bv. vulgaris
phaseoli FA23
AJUI01000023.1 Rhizobium legu- Trifolium I
minosarum bv. nanum
trifolii CC278f
KI911771.1 Rhizobium legu- Trifolium I
minosarum bv. ambiguum
trifolii CC283b
ATYQ01000001.1 Rhizobium legu- Vicia faba I
minosarum bv.
viciae VF39
AUFB01000001.1 Rhizobium leu- Phaseolus D
caenae USDA vulgaris
9039
FMAF01000001.1 Rhizobium lusi- Phaseolus D
tanum P1-7 vulgaris
HF536772.1 Rhizobium Mimosa pu- I
mesoamericanum dica
STM3625
ATYY01000001.1 Rhizobium Mimosa pu- I
mesoamericanum dica
STM6155 81
Table 6.2 Continued
Genome acces- Strain Legume Type sion host
ATTQ01000001.1 Rhizobium mon- Medicago I
golense USDA ruthenica
1844
AHJU02000001.1 Rhizobium phase- Phaseolus D
oli Ch24-10 vulgaris
AEYF01000001.1 Rhizobium sp. Phaseolus D
CCGE 510 albescens
FMAJ01000001.1 Rhizobium sp. Phaseolus D
HBR26 vulgaris
CP004015.1 Rhizobium tropici Phaseolus D
CIAT 899 vulgaris
JFGO01000001.1 Sinorhizobium Phaseolus D
americanum vulgaris
CCGM7
ATYB01000007.1 Sinorhizobium ar- Prosopis I
boris LMG 14919 chilensis
AJQN01000001.1 Sinorhizobium Glycine D
fredii CCBAU max
25509
AMCX01000001.1 Sinorhizobium Phaseolus D
fredii GR64 vulgaris
HE616890.1 Sinorhizobium Glycine D
fredii HH103 max 82
Table 6.2 Continued
Genome acces- Strain Legume Type sion host
CP000874.1 Sinorhizobium Lablab pur- D
fredii NGR234 pureus
CP003563.1 Sinorhizobium Glycine D
fredii USDA 257 max
AQWP01000001.1 Sinorhizobium Phaseolus D
meliloti 4H41 vulgaris
ATZC01000001.1 Sinorhizobium Phaseolus D
meliloti GVPV12 vulgaris
ATYC01000007.1 Sinorhizobium Melilotus I
meliloti siculus
WSM4191
LATE01000001.1 Sinorhizobium sp. Prosopis I
PC2 cineraria
CP003057.2 Xanthomonas
oryzae pv. oryzi-
cola BLS256
KE136308.1 Erwinia tra-
cheiphila PSU-1 83
Table 6.3: Strains containing a CYP 114 − F dGA gene fusion.
Bradyrhizobium sp. ERR11
Bradyrhizobium yuanmingense BR3267
Rhizobium etli CIAT 652
Rhizobium etli CNPAF512
Rhizobium etli bv. phaseoli IE4803
Rhizobium phaseoli Ch24-10
Rhizobium etli CFN 42
Rhizobium leguminosarum bv. phaseoli CCGM1
Rhizobium acidisoli FH23
Rhizobium sp. HBR26
Mesorhizobium sp. WSM3224
Rhizobium etli bv. phaseoli str. IE4803
Rhizobium leguminosarum bv. trifolii CC278f
Rhizobium leguminosarum bv. phaseoli FA23
Rhizobium etli bv. mimosae str. Mim1
Sinorhizobium meliloti WSM4191
Table 6.4: Strains containing a F dGA − SDRGA gene fusion.
Mesorhizobium sp. LSHC414A00
Mesorhizobium loti MAFF303099
Mesorhizobium loti CJ3sym
Mesorhizobium loti R7A
Rhizobium mongolense USDA 1844
Mesorhizobium sp. LSJC269B00
Mesorhizobium sp. LSJC277A00 84
Table 6.4 Continued
Mesorhizobium sp. LSJC265A00
Mesorhizobium sp. L2C066B000
Mesorhizobium sp. L2C085B000
Bradyrhizobium sp. WSM1253
Mesorhizobium loti NZP2042
Mesorhizobium sp. LSHC440B00
Mesorhizobium sp. SEMIA 3007
Mesorhizobium sp. L48C026A00
Mesorhizobium sp. LNHC232B00
Bradyrhizobium sp. WSM471
Bradyrhizobium japonicum USDA 135
Mesorhizobium sp. LSJC268A00
Mesorhizobium sp. STM 4661
Mesorhizobium loti NZP2037
Mesorhizobium loti R88b
Mesorhizobium loti NZP2014
Mesorhizobium sp. LSHC420B00
Mesorhizobium sp. LNJC372A00
Mesorhizobium sp. LSHC412B00
Mesorhizobium erdmanii USDA 3471
Mesorhizobium sp. L103C105A0
Mesorhizobium sp. LSHC422A00 85
CHAPTER 7. DISCUSSION AND FUTURE DIRECTIONS
Evolution is an important force that drives the complexity of biological systems. Gene blocks in bacteria are the products of the evolution of complex systems. Many models were proposed to explain the evolution of gene blocks in bacteria; however, the event-based model provided more practical tool to study the evolutionary change in bacterial gene blocks. In this thesis, I presented three studies on the evolution of gene blocks using the event-based model. I hope to draw more attention to this fine niche of evolution study and promote the usage of the event-based model to quantify the evolutionary change in gene blocks. There are a few directions that can be applied to my current methods so that the model can tackle more difficult and important problems.
In this chapter, I will discuss three chapters4,5, and6. Each project in the discussion section is organized into 3 components:
• Problem: the biological problem of interest.
• Approach: the computational methods to tackle the problem.
• Results: the impact of this work on scientific research. An outline of possible extensions of my methods and some interesting ideas to examine appear in the section on future direction.
7.1 Finding Orthologous Gene Blocks in Bacteria: the Computational Hardness of the Problem and Novel Methods to Address It
7.1.1 Discussions
• Problem. In Chapter4, we presented the problem of finding orthologous gene blocks in bacteria using the event-based method, which has been an established and useful method in
measuring the evolutionary changes of gene block in bacteria (Ream et al., 2015). Although
the problem was studied previously, there has been no formal study of its complexity. The 86
previous work introduced a heuristic approach that provided neither running time nor opti-
mality of its results.
• Approach. To examine the complexity of the problems, we formalized them in chapter3. Under these formalizations, we proved that the problem was intrinsically hard (NP-Hard).
This means that there are likely no polynomial algorithms to solve the problem exactly.
Furthermore, we showed that the problem was hard to approximate (APX-Hard) within ln n
of the optimal result. Then we presented my greedy approach and proved it was a ln n
approximation algorithm. In order to compare my approach and the previous approach,
we formalized an Integer Linear Program of our problem and solved it exactly using the
CPLEX algorithm. The results showed that my method was faster and more accurate than
the previous approach.
• Results. The article describes the first complexity study of finding orthologous gene blocks in bacteria. It provided insight on how hard the problem was, which helped us not pursuing
to solve the problem exactly and guided us to other alternatives. By exploring other alterna-
tives approaches such as approximation, we came up with a polynomial-time algorithm that
provided a lower bound on the results. This work showed that even the simplified version of a
biological problem could be extremely hard to solve. However, one should not be discouraged
since there are approaches to solve it exactly with small data sets and to approximate it with
large data sets.
7.1.2 Future directions
In the article, we tackled a relaxed version of the original problem, which was already hard.
There are several approaches to solve such problems: approximation algorithms, fixed-parameter tractable algorithms, randomized algorithms, etc. Fixed-parameter algorithms identify a specific parameter in intractable problems to be fixed so that the algorithms can solve it efficiently while the parameters are small. However, since our problem was reducible to set cover and set cover is W[2]-hard (Downey and Fellows, 2012), our problem is probably not fixed-parameter tractable. 87
Proof for such hardness can prevent us from wasting our effort in trying to fix the parameter in our problem. On the other hand, randomized algorithms can be a promising approach. There are several well known randomized algorithms such as Quicksort algorithm (Hoare, 1961) and
Karger algorithm (Karger, 1999). Interestingly, our cost function in problem3 is a non-monotonic modular function. It means that for each set A and set B that are subsets of our gene set S, then f(S, A) + f(S, B) = f(S, A ∪ B) + f(S, A ∩ B) . Hence, we can use either the non-adaptive randomized algorithm or the adaptive randomized algorithm to provide a constant approximation for our problem (Feige et al., 2011). In future work, we can compare those approaches with the algorithms presented in Chapter2.
Since our relaxed version is NP-Hard, the original problem should also be hard to solve exactly.
One interesting question is whether our greedy approach can provide any lower bound guarantee for the original problem. Another question is whether the cost function that incorporates duplication cost is still modular. If it is, devising appropriate randomized algorithms to solve the problem is interesting research. Besides, we can also redefine our problem, such that the objective function not only minimizes the evolutionary cost between the reference gene block and the query gene block but also minimizes the total evolutionary costs between any two pair orthologous gene blocks from our group of species.
Lastly, I propose to run our methods on simulated data. The current data set in our article is taken from the previous paper by (Ream et al., 2015). Although our greedy algorithm is about 100 times faster than the previous method, it only takes the previous approach about 10−5 seconds to run. Therefore, it would be interesting to devise simulation schemes to generate our data set so that they can showcase scenarios where one of the methods excels and other suffers. Once we have several different data sets, we can better evaluate different approaches. 88
7.2 Tracing the Ancestry of Operons in Bacteria
7.2.1 Discussions
• Problem. In chapter5, we presented the problem of reconstructing the ancestral gene block in bacteria using the event-based method. Although ancestral reconstruction is a well-known
problem, there has been no research done in reconstructing gene blocks in bacteria.
• Approach. In our work, we presented maximum parsimony approaches to reconstruct the ancestral state of the gene blocks. Under this assumption, we developed two methods: local
method and global method. The local method optimized the evolutionary changes between
any pair of nodes and its direct descendants, while the global method optimized the total
sum of evolutionary changes of the phylogenetic tree.
• Results. Using the two methods, we reconstructed the ancestral gene blocks of two groups of species: Gram-negative and Gram-positive bacteria. In Gram-negative bacteria, we used
E. coli as our reference taxon and reconstructed ancestral states of 55 operons from E. coli.
In Gram-positive bacteria, we used B. subtilis as our reference taxon and reconstructed an-
cestral states of 87 operons from B. subtilis. From the reconstructions, we observed several
interesting trends of gene blocks emerge together. We also noticed a couple of core gene
clusters that dated back to the most common ancestors and intermediate functional forms of
orthoblocks. We found that the formation of a protein complex was the main force for gene
block conservation. Regarding our two methods, we found that the global approach gave
lower accumulative evolutionary changes. We also proved that our two methods guaranteed
the minimum deletion costs and duplication costs in their objective functions. Our ROAGUE
software was the first software to interrogate the evolution of gene blocks in bacteria using
the event-based model. The tool provided a fast way to trace the ancestral gene block and
beautiful visualization of the process. We hope ROAGUE can encourage more researchers to
study the evolution of gene block in bacteria. 89
7.2.2 Future directions
Unlike the small parsimony problem (Fitch, 1971), our local ancestral reconstruction problem is quite hard. However, we were not able to elucidate its hardness with a formal reduction to an
NP-Hard problem. Hence, it would be very beneficial to provide NP-Hardness proof for both local and global approaches. Currently, our approaches only guaranteed minimum deletion costs and duplication costs separately. The NP-Hardness reduction can direct us to a potential approximation or randomization scheme. Besides, our local approach employed a correction mechanism that prevented a gene from being greedily propagated into the tree more than it should be. A study of how this affects the lower bound of the total deletion costs could add more advantages to the local approach, as it was at least twice as fast as the global approach.
ROAGUE software did not take into account horizontal gene transfer and was agnostic to gene order. HGT and gene orders are vital forces of gene block evolution in some species (Fang et al.,
2008; Lawrence and Roth, 1996). Therefore, it would be a great advantage to incorporate those into our models. Regarding HGT, there are two main approaches: the parametric approach and the phylogenetic approach. Parametric approaches compute the GC content for a sliding window and compare it over a specific range over the genome sequence (Nguyen et al., 2015). The fragments that are profoundly different from the genomic signature are flagged as potential horizontal gene transfer regions. On the other hand, phylogenetic approaches identify the inconsistencies between gene trees and the species tree. By cutting an edge (pruning) and regrafting it to another edge, phylogenetic approaches reconcile the species tree with the gene trees and infer possible HGT nodes
(Hallett and Lagergren, 2001).
The second approach is more appropriate to incorporate into ROAGUE. For example, given node u and its direct descendant u1, u2, we might consider the scenario of having gene g in u1 or u2 is the result of HGT events instead of assuming node u should have gene g. This can be done by adding HGT event to our event-based model. Regarding gene order, one possible approach is to consider rearrangement events in a gene block. We can quantify the rearrangement cost of two blocks by finding the smallest number of steps to rearrange one block to have the same gene 90 order as the other block. Personally, the HGT and gene order elements will provide a more general perspective in the evolution of gene block in bacteria.
Last but not least, I propose to use the maximum likelihood model to reconstruct the ancestral gene blocks in bacteria. In order to use the ML model, it is necessary to compute the likelihood of one gene block evolves to another. Unlike the nucleotide or the amino acid substitutions model where all possible cases are known, we could not generate all the possible gene blocks given the gene set in the reference operon/ gene block. Therefore, we could use our three evolutionary events as the state of transitions. Furthermore, I suggest using the normalized distance matrix (Ream et al., 2015) to compute the transitional probability between each pair of states. I believe this model could provide a better understanding of the evolution of gene blocks as it takes into account the branch length as it can provide an estimation of our reconstruction.
7.3 Unraveling a Tangled Skein: Evolutionary Analysis of the Bacterial Gibberellin Biosynthesis Operon
7.3.1 Discussions
• Problem. In chapter6, we studied the evolution of the bacterial gibberellin biosynthetic operon. Although gibberellin phytohormones have been studied more than a century, we
only know a little about gibberellin biosynthesis in bacteria.
• Approach. We extended our ROAGUE method to study the evolution of bacterial gibberellin operon. Our software implemented a pipeline to retrieve all contigs from the NCBI database
given a list of species names and annotated them using prokka (Seemann, 2014). Besides,
ROAGUE enabled the detection of pseudo-genes cyp115, and added an ad-hoc procedure
to identify the fusion of gene fdGA into gene sdrGA or gene cyp114. Combining with the experimental study of the genes ids2 and ggps2, we employed our ROAGUE software to
reconstruct the ancestral gene blocks of GA operon within alphaproteobacteria rhizobia
• Results. Using ROAGUE, we were able to characterize the loss, gain, and HGT of GA operon genes within alphaproteobacteria rhizobia. In addition to species tree, we built an operon tree 91
using the concatenated protein sequences of the core GA operon in order to capture the HGT
events. The software shed light on how the full-length gene cyp115 was lost and was regained
independently in three different lineages, which further supports a potential HGT event. The
full-length gene cyp115 is essential for GA biosynthesis in bacteria because it catalyzes the
final step in bioactive GA4 (Nagel and Peters, 2017; Nagel et al., 2017; Nett et al., 2017a). The analyses using GC content and ROAGUE suggested that HGT happened as the GA operon
evolved in plant-associated bacteria. In addition, the intermediate forms of gibberellin (lack
of gene cyp115 ) indicated that the production of GA might not be advantageous to all the
proteobacteria rhizobia. These results showcase the power of ROAGUE in analyzing the
evolution of gene block in bacteria.
7.3.2 Future directions
In the paper, we built the operon tree using a naive method, which concatenated the alignments of protein sequences of the core GA operon. In the future, we can apply Duplication-Transfer-Loss
(DTL) reconciliation to build on the operon tree. There are several implementations of this model that can be readily applied (Bansal et al., 2012; Kordi and Bansal, 2017). This can also help with the HGT analysis of the GA operon.
Regarding the fusion detection, our current method only provided an ad-hoc mechanism. One potential approach is to use GeneFusion to detect the gene fusion before finding orthologous gene block (Chen et al., 2018). GeneFusion is a practical method to detect gene fusion with high sensitivity. The software also provides an interactive visualization of the result that provides quality of the predictions, the coordinated of the fusion genes, etc. However, integrating this software into our pipeline might be a challenge. On the other hand, I propose to detect gene fusion while finding orthologous genes. This can be done by sorting orthologous candidates gene by their start codon and resolve overlapping genes. This approach can be readily employed in our pipeline without using external software. 92
Finally, I suggest applying ROAGUE software to study the evolution of GA operon in fungi and plants. Although our model was initially established to study gene block evolution in bacteria, it would be beneficial to utilize it for other species. One of the challenges would be how we quantify the evolutionary changes in gene block in fungi and plants. If we could find appropriate models for those two organisms, it would be possible to understand how GA operons evolve in a broader aspect. 93
BIBLIOGRAPHY
Acharya, R. (2009). Overexpression, purification, and characterization of MmgD from Bacillus subtilis strain 168. PhD thesis, University of North Carolina at Greensboro.
Ballouz, S., Francis, A. R., Lan, R., and Tanaka, M. M. (2010). Conditions for the evolution of gene clusters in bacterial genomes. PLoS Computational Biology, 6(2).
Bansal, M. S., Alm, E. J., and Kellis, M. (2012). Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics, 28(12):i283–i291.
Beckwith, J. R., Pardee, A. B., Austrian, R., and Jacob, F. (1962). Coordination of the synthesis of the enzymes in the pyrimidine pathway of e. coli. Journal of molecular biology, 5(6):618–634.
Belfaiza, J., Parsot, C., Martel, A., De La Tour, C. B., Margarita, D., Cohen, G. N., and Saint- Girons, I. (1986). Evolution in biosynthetic pathways: two enzymes catalyzing consecutive steps in methionine biosynthesis originate from a common ancestor and possess a similar regulatory region. Proceedings of the National Academy of Sciences, 83(4):867–871.
Chen, S., Liu, M., Huang, T., Liao, W., Xu, M., and Gu, J. (2018). Genefuse: detection and visualization of target gene fusions from dna sequencing data. International journal of biological sciences, 14(8):843.
Chernomor, O., Minh, B. Q., Forest, F., Klaere, S., Ingram, T., Henzinger, M., and von Hae- seler, A. (2015). Split diversity in constrained conservation prioritization using integer linear programming. Methods in Ecology and Evolution, 6(1):83–91.
Chvatal, V. (1979). A greedy heuristic for the set-covering problem. Mathematics of Operations Research, 4(3):233–235.
Demerec, M. and Hartman, P. E. (1959). Complex loci in microorganisms. Annual Reviews in Microbiology, 13(1):377–406.
Dobrindt, U., Hochhut, B., Hentschel, U., and Hacker, J. (2004). Genomic islands in pathogenic and environmental microorganisms. Nature Reviews Microbiology, 2(5):414–424.
Downey, R. G. and Fellows, M. R. (2012). Parameterized complexity. Springer Science & Business Media.
Eisen, J. A. (1998). A phylogenomic study of the muts family of proteins. Nucleic Acids Research, 26(18):4291–4300. 94
Enault, F., Suhre, K., Abergel, C., Poirot, O., and Claverie, J.-M. (2003). Annotation of bacterial genomes using improved phylogenomic profiles. Bioinformatics, 19(suppl 1):i105–i107.
Fang, G., Rocha, E. P., and Danchin, A. (2008). Persistence drives gene clustering in bacterial genomes. BMC Genomics, 9(1):4.
Fani, R., Brilli, M., and Li`o,P. (2005). The origin and evolution of operons: the piecewise building of the proteobacterial histidine operon. Journal of Molecular Evolution, 60(3):378–390.
Farris, J. S. (1977). Phylogenetic analysis under dollo’s law. Systematic Biology, 26(1):77–88.
Feige, U., Mirrokni, V. S., and Vondr´ak,J. (2011). Maximizing non-monotone submodular func- tions. SIAM Journal on Computing, 40(4):1133–1153.
Fitch, W. M. (1971). Toward defining the course of evolution: minimum change for a specific tree topology. Systematic Biology, 20(4):406–416.
Freiberg, C., Fellay, R., Bairoch, A., Broughton, W. J., Rosenthal, A., and Perret, X. (1997). Molecular basis of symbiosis between rhizobium and legumes. Nature, 387(6631):394–401.
Gonz´alez,V., Bustos, P., Ram´ırez-Romero, M. A., Medrano-Soto, A., Salgado, H., Hern´andez- Gonz´alez,I., Hern´andez-Celis,J. C., Quintero, V., Moreno-Hagelsieb, G., Girard, L., et al. (2003). The mosaic structure of the symbiotic plasmid of rhizobium etlicfn42 and its relation to other symbiotic genome compartments. Genome Biology, 4(6):R36.
G¨ottfert,M., R¨othlisberger, S., K¨undig, C., Beck, C., Marty, R., and Hennecke, H. (2001). Potential symbiosis-specific genes uncovered by sequencing a 410-kilobase dna region of the bradyrhizobium japonicum chromosome. Journal of Bacteriology, 183(4):1405–1412.
Grishin, A. M., Ajamian, E., Tao, L., Zhang, L., Menard, R., and Cygler, M. (2011). Structural and functional studies of the escherichia coli phenylacetyl-coa monooxygenase complex. Journal of Biological Chemistry, 286(12):10735–10743.
Hallett, M. T. and Lagergren, J. (2001). Efficient algorithms for lateral gene transfer problems. In Proceedings of The fifth annual international conference on Computational biology, pages 149– 156.
Harms, M. J. and Thornton, J. W. (2010). Analyzing protein structure and function using ancestral gene reconstruction. Current Opinion in Structural Biology, 20(3):360–366.
Hayashi, S., Gresshoff, P. M., and Ferguson, B. J. (2014). Mechanistic action of gibberellins in legume nodulation. Journal of Integrative Plant Biology, 56(10):971–978.
Hedden, P. and Sponsel, V. (2015). A century of gibberellin research. Journal of Plant Growth Regulation, 34(4):740–760. 95
Hershey, D. M., Lu, X., Zi, J., and Peters, R. J. (2014). Functional conservation of the capacity for ent-kaurene biosynthesis and an associated operon in certain rhizobia. Journal of Bacteriology, 196(1):100–106.
Hiroe, A., Tsuge, K., Nomura, C. T., Itaya, M., and Tsuge, T. (2012). Rearrangement of gene order in the phacab operon leads to effective production of ultrahigh-molecular-weight poly[(r)- 3-hydroxybutyrate] in genetically engineered escherichia coli. Applied and Environmental Micro- biology, 78(9):3177–3184.
Hoare, C. A. R. (1961). Algorithm 64: quicksort. Communications of the ACM, 4(7):321.
Homuth, G., Masuda, S., Mogk, A., Kobayashi, Y., and Schumann, W. (1997). The dnak operon of bacillus subtilis is heptacistronic. Journal of Bacteriology, 179(4):1153–1164.
Horowitz, N. H. (1945). On the evolution of biochemical syntheses. Proceedings of the National Academy of Sciences of the United States of America, 31(6):153.
Hou, X., Lee, L. Y. C., Xia, K., Yan, Y., and Yu, H. (2010). Dellas modulate jasmonate signaling via competitive binding to jazs. Developmental Cell, 19(6):884–894.
Ismail, W., El-Said Mohamed, M., Wanner, B. L., Datsenko, K. A., Eisenreich, W., Rohdich, F., Bacher, A., and Fuchs, G. (2003). Functional genomics by nmr spectroscopy. European Journal of Biochemistry, 270(14):3047–3054.
Jacob, F. and Monod, J. (1961). On the regulation of gene activity. In Cold Spring Harbor Symposia on Quantitative Biology, volume 26, pages 193–211, Cold Spring Harbor, NY. Cold Spring Harbor Laboratory Press.
Kaneko, T., Nakamura, Y., Sato, S., Minamisawa, K., Uchiumi, T., Sasamoto, S., Watanabe, A., Idesawa, K., Iriguchi, M., Kawashima, K., et al. (2002). Complete genomic sequence of nitrogen- fixing symbiotic bacterium bradyrhizobium japonicum usda110. DNA Research, 9(6):189–197.
Karger, D. R. (1999). Random sampling in cut, flow, and network design problems. Mathematics of Operations Research, 24(2):383–413.
Kasimoglu, E., Park, S.-J., Malek, J., Tseng, C. P., and Gunsalus, R. P. (1996). Transcriptional regulation of the proton-translocating atpase (atpibefhagdc) operon of escherichia coli: control by cell growth rate. Journal of Bacteriology, 178(19):5563–5567.
Koonin, E. V. (2009). Evolution of genome architecture. The International Journal of Biochemistry & Cell Biology, 41(2):298–306.
Koonin, E. V., Makarova, K. S., and Aravind, L. (2001). Horizontal gene transfer in prokaryotes: quantification and classification. Annual Review of Microbiology, 55(1):709–742. 96
Kordi, M. and Bansal, M. S. (2017). Exact algorithms for duplication-transfer-loss reconciliation with non-binary gene trees. IEEE/ACM Transactions on Computational Biology and Bioinfor- matics.
Koumandou, V. L. and Kossida, S. (2014). Evolution of the f 0 f 1 atp synthase complex in light of the patchy distribution of different bioenergetic pathways across prokaryotes. PLoS Comput Biol, 10(9):e1003821.
Lawrence, J. G. and Roth, J. R. (1996). Selfish operons: horizontal transfer may drive the evolution of gene clusters. Genetics, 143(4):1843–1860.
Levy, A., Gonzalez, I. S., Mittelviefhaus, M., Clingenpeel, S., Paredes, S. H., Miao, J., Wang, K., Devescovi, G., Stillman, K., Monteiro, F., et al. (2018). Genomic features of bacterial adaptation to plants. Nature Genetics, 50(1):138–150.
Lim, H. N., Lee, Y., and Hussein, R. (2011). Fundamental relationship between operon organization and gene expression. Proceedings of the National Academy of Sciences, 108(26):10626–10631.
Lu, X., Hershey, D. M., Wang, L., Bogdanove, A. J., and Peters, R. J. (2015). An ent-kaurene- derived diterpenoid virulence factor from x anthomonas oryzae pv. oryzicola. New Phytologist, 206(1):295–302.
Luc, N., Risler, J.-L., Bergeron, A., and Raffinot, M. (2003). Gene teams: a new formalization of gene clusters for comparative genomics. Computational Biology and Chemistry, 27(1):59–67.
Luengo, J. M., Garcia, J. L., and Olivera, E. R. (2001). The phenylacetyl-coa catabolon: a complex catabolic unit with broad biotechnological applications. Molecular Microbiology, 39(6):1434–1442.
Marcotte, E. M., Pellegrini, M., Ng, H.-L., Rice, D. W., Yeates, T. O., and Eisenberg, D. (1999). Detecting protein function and protein-protein interactions from genome sequences. Science, 285(5428):751–753.
Martin, F. J. and McInerney, J. O. (2009). Recurring cluster and operon assembly for phenylacetate degradation genes. BMC Evolutionary Biology, 9(1):36.
McAdam, E. L., Reid, J. B., and Foo, E. (2018). Gibberellins promote nodule organogenesis but inhibit the infection stages of nodulation. Journal of experimental botany, 69(8):2117–2130.
M´endez,C., Baginsky, C., Hedden, P., Gong, F., Car´u,M., and Rojas, M. C. (2014). Gibberellin oxidase activities in bradyrhizobium japonicum bacteroids. Phytochemistry, 98:101–109.
Nagel, R., Bieber, J. E., Schmidt-Dannert, M. G., Nett, R. S., and Peters, R. J. (2018). A third class: Functional gibberellin biosynthetic operon in beta-proteobacteria. Frontiers in Microbiology, 9:2916. 97
Nagel, R. and Peters, R. J. (2017). Investigating the phylogenetic range of gibberellin biosynthesis in bacteria. Molecular Plant-Microbe Interactions, 30(4):343–349.
Nagel, R., Turrini, P. C., Nett, R. S., Leach, J. E., Verdier, V., Van Sluys, M.-A., and Peters, R. J. (2017). An operon for production of bioactive gibberellin a4 phytohormone with wide distribution in the bacterial rice leaf streak pathogen xanthomonas oryzae pv. oryzicola. New Phytologist, 214(3):1260–1266.
Navarro, L., Bari, R., Achard, P., Lis´on,P., Nemri, A., Harberd, N. P., and Jones, J. D. (2008). Dellas control plant immune responses by modulating the balance of jasmonic acid and salicylic acid signaling. Current Biology, 18(9):650–655.
Nett, R. S., Contreras, T., and Peters, R. J. (2017a). Characterization of cyp115 as a gibberellin 3-oxidase indicates that certain rhizobia can produce bioactive gibberellin a4. ACS Chemical Biology, 12(4):912–917.
Nett, R. S., Montanares, M., Marcassa, A., Lu, X., Nagel, R., Charles, T. C., Hedden, P., Rojas, M. C., and Peters, R. J. (2017b). Elucidation of gibberellin biosynthesis in bacteria reveals convergent evolution. Nature Chemical Biology, 13(1):69.
Nguyen, H. N., Jain, A., Eulenstein, O., and Friedberg, I. (2019). Tracing the ancestry of operons in bacteria. Bioinformatics, 35(17):2998–3004.
Nguyen, M., Ekstrom, A., Li, X., and Yin, Y. (2015). Hgt-finder: A new tool for horizontal gene transfer finding and application to aspergillus genomes. Toxins, 7(10):4035–4053.
NJ, G. (1984). Construction and characterization of an escherichia coli strain with a unci mutation. Journal of Bacteriology, 158:820–825.
Nogales, J., Macchi, R., Franchi, F., Barzaghi, D., Fernandez, C., Garcia, J. L., Bertoni, G., and Diaz, E. (2007). Characterization of the last step of the aerobic phenylacetic acid degradation pathway. Microbiology, 153(2):357–365.
Ogata, H., Fujibuchi, W., Goto, S., and Kanehisa, M. (2000). A heuristic graph comparison algo- rithm and its application to detect functionally related enzyme clusters. Nucleic Acids Research, 28(20):4021–4028.
Oldroyd, G. E., Murray, J. D., Poole, P. S., and Downie, J. A. (2011). The rules of engagement in the legume-rhizobial symbiosis. Annual Review of Genetics, 45:119–144.
Omelchenko, M., Makarova, K., Wolf, Y., Rogozin, I., and Koonin, E. (2003). Evolution of mosaic operons by horizontal gene transfer and gene displacement in situ. Genome Biology, 4(9):R55+. 98
Overbeek, R., Fonstein, M., D’souza, M., Pusch, G. D., and Maltsev, N. (1999). The use of gene clusters to infer functional coupling. Proceedings of the National Academy of Sciences of the United States of America, 96(6):2896–2901.
Pardee, A. B., Jacob, F., and Monod, J. (1959). The genetic control and cytoplasmic expression of “inducibility” in the synthesis of β-galactosidase by e. coli. Journal of Molecular Biology, 1(2):165–178.
Pauling, L., Zuckerkandl, E., and Henriksen, T. (1963). Chemical paleogenetics. Acta chem. scand, 17:S9–S16.
Pueppke, S. G. and Broughton, W. J. (1999). Rhizobium sp. strain ngr234 and r. fredii usda257 share exceptionally broad, nested host ranges. Molecular Plant-Microbe Interactions, 12(4):293– 318.
Pupko, T., Pe, I., Shamir, R., and Graur, D. (2000). A fast algorithm for joint reconstruction of ancestral amino acid sequences. Molecular Biology and Evolution, 17(6):890–896.
Quattlebaum, A. L. (2009). Characterization of Biosynthetic and Catabolic Pathways of Bacillus Subtilis Strain 168. PhD thesis, University of North Carolina at Greensboro.
Ream, D. C., Bankapur, A. R., and Friedberg, I. (2015). An event-driven approach for studying gene block evolution in bacteria. Bioinformatics, 31(13):2075–2083.
Schulz, T., Stoye, J., and Doerr, D. (2018). Graphteams: a method for discovering spatial gene clusters in hi-c sequencing data. BMC Genomics, 19(5):308.
Schwechheimer, C. (2012). Gibberellin signaling in plants–the extended version. Frontiers in Plant Science, 2:107.
Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14):2068– 2069.
Senior, A. E. (1990). The proton-translocating atpase of escherichia coli. Annual Review of Bio- physics and Biophysical Chemistry, 19(1):7–41. PMID: 2141983.
Srinivasan, B. S., Caberoy, N. B., Suen, G., Taylor, R. G., Shah, R., Tengra, F., Goldman, B. S., Garza, A. G., and Welch, R. D. (2005). Functional genome annotation through phylogenomic mapping. Nature Biotechnology, 23(6):691–698.
Stahl, F. W. and Murray, N. E. (1966). The evolution of gene clusters and genetic circularity in microorganisms. Genetics, 53(3):569.
Sullivan, J. T., Trzebiatowski, J. R., Cruickshank, R. W., Gouzy, J., Brown, S. D., Elliot, R. M., Fleetwood, D. J., McCallum, N. G., Rossbach, U., Stuart, G. S., et al. (2002). Comparative 99
sequence analysis of the symbiosis island of mesorhizobium loti strain r7a. Journal of Bacteriology, 184(11):3086–3095.
Swofford, D. L. and Maddison, W. P. (1987). Reconstructing ancestral character states under wagner parsimony. Mathematical Biosciences, 87(2):199–229.
Syvanen, M. (1994). Horizontal gene transfer: evidence and possible consequences. Annual Review of Genetics, 28(1):237–261.
Tatsukami, Y. and Ueda, M. (2016). Rhizobial gibberellin negatively regulates host nodule number. Scientific Reports, 6:27998.
Teufel, R., Mascaraque, V., Ismail, W., Voss, M., Perera, J., Eisenreich, W., Haehnel, W., and Fuchs, G. (2010). Bacterial phenylalanine and phenylacetate catabolic pathway revealed. Pro- ceedings of the National Academy of Sciences, 107(32):14390–14395.
Tully, R., van Berkum, P., Lovins, K., and Keister, D. (1998). Identification and sequencing of a cytochrome p450 gene cluster from bradyrhizobium japonicum. Biochimica et Biophysica Acta (BBA)-Gene Structure and Expression, 1398(3):243–255.
Uchiumi, T., Ohwada, T., Itakura, M., Mitsui, H., Nukui, N., Dawadi, P., Kaneko, T., Tabata, S., Yokoyama, T., Tejima, K., et al. (2004). Expression islands clustered on the symbiosis island of the mesorhizobium loti genome. Journal of Bacteriology, 186(8):2439–2448.
Ueguchi-Tanaka, M., Ashikari, M., Nakajima, M., Itoh, H., Katoh, E., Kobayashi, M., Chow, T.-y., Yue-ie, C. H., Kitano, H., Yamaguchi, I., et al. (2005). Gibberellin insensitive dwarf1 encodes a soluble receptor for gibberellin. Nature, 437(7059):693–698.
Voigt, B., Hoi, L. T., J¨urgen,B., Albrecht, D., Ehrenreich, A., Veith, B., Evers, S., Maurer, K.-H., Hecker, M., and Schweder, T. (2007). The glucose and nitrogen starvation response of bacillus licheniformis. Proteomics, 7(3):413–423.
Wells, J. N., Bergendahl, L. T., and Marsh, J. A. (2016). Operon gene order is optimized for ordered protein complex assembly. Cell reports, 14(4):679–685.
Wetzstein, M., V¨olker, U., Dedio, J., L¨obau,S., Zuber, U., Schiesswohl, M., Herget, C., Hecker, M., and Schumann, W. (1992). Cloning, sequencing, and molecular analysis of the dnak locus from bacillus subtilis. Journal of Bacteriology, 174(10):3300–3310.
Wiemann, P., Sieber, C. M., Von Bargen, K. W., Studt, L., Niehaus, E.-M., Espino, J. J., Huss, K., Michielse, C. B., Albermann, S., Wagner, D., et al. (2013). Deciphering the cryptic genome: genome-wide analyses of the rice pathogen fusarium fujikuroi reveal complex regulation of sec- ondary metabolism and novel metabolites. PLoS Pathogens, 9(6). 100
Williams, P. D., Pollock, D. D., Blackburne, B. P., and Goldstein, R. A. (2006). Assessing the accuracy of ancestral protein reconstruction methods. PLoS Computational Biology, 2(6).
Wolf, Y. I., Rogozin, I. B., Kondrashov, A. S., and Koonin, E. V. (2001). Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. Genome Research, 11(3):356–372.
Yamaguchi, S. (2008). Gibberellin metabolism and its regulation. Annual Review of Plant Physi- ology, 59:225–251.
Yang, Z. (2007). Paml 4: phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution, 24(8):1586–1591.
Yang, Z., Kumar, S., and Nei, M. (1995). A new method of inference of ancestral nucleotide and amino acid sequences. Genetics, 141(4):1641–1650.
Zi, J., Mafu, S., and Peters, R. J. (2014). To gibberellins and beyond! surveying the evolution of (di) terpenoid metabolism. Annual Review of Plant Biology, 65:259–286.
Zuckerkandl, E. and Pauling, L. (1965). Evolutionary divergence and convergence in proteins. In Evolving genes and proteins, pages 97–166. Elsevier, New York, NY.