Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations

2020

Evolution of gene blocks in using an event-based model

Huy Nguyen Iowa State University

Follow this and additional works at: https://lib.dr.iastate.edu/etd

Recommended Citation Nguyen, Huy, "Evolution of gene blocks in bacteria using an event-based model" (2020). Graduate Theses and Dissertations. 18368. https://lib.dr.iastate.edu/etd/18368

This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Evolution of gene blocks in bacteria using an event-based model

by

Huy Nguyen

A dissertation submitted to the graduate faculty

in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Major: Computer Science

Program of Study Committee: Oliver Eulenstein, Co-major Professor Iddo Friedberg, Co-major Professor David Fernandez-Baca Dennis V. Lavrov Xiaoqiu Huang

The student author, whose presentation of the scholarship herein was approved by the program of the study committee, is solely responsible for the content of this dissertation. The Graduate College will ensure this dissertation is globally accessible and will not permit alterations after a degree is conferred.

Iowa State University

Ames, Iowa

2020

Copyright © Huy Nguyen, 2020. All rights reserved. ii

DEDICATION

To my parents, Danh Nguyen and Hoa Tran, who have encouraged, inspired, and enabled me to pursue my educational studies, and my family and friends for their support throughout my studies at home and abroad. iii

TABLE OF CONTENTS

Page

LIST OF TABLES ...... v

LIST OF FIGURES ...... vi

ACKNOWLEDGMENTS ...... ix

ABSTRACT ...... x

CHAPTER 1. INTRODUCTION ...... 1

CHAPTER 2. RELATED WORKS ...... 4 2.1 Gene Block Evolution Models ...... 4 2.2 Finding Orthologous Gene Blocks in Bacteria ...... 8 2.3 Ancestral Reconstruction of Gene Blocks in Bacteria ...... 9

CHAPTER 3. PRELIMINARIES ...... 13 3.1 Event-based model ...... 13 3.2 Phylogenetic tree ...... 16

CHAPTER 4. FINDING ORTHOLOGOUS GENE BLOCKS IN BACTERIA: THE COM- PUTATIONAL HARDNESS OF THE PROBLEM AND NOVEL METHODS TO AD- DRESS IT ...... 18 4.1 The Relevant-Gene-Block problem ...... 19 4.2 The RGB problem is Hard ...... 20 4.3 Solving the RGB problem ...... 23 4.3.1 Greedy Algorithm ...... 24 4.3.2 Integer Linear Programming ...... 25 4.4 Results and Discussion ...... 26 4.4.1 Scalability Study ...... 27 4.4.2 Accuracy Study ...... 28 4.4.3 Concluding Discussion ...... 30 4.5 Conclusion ...... 31

CHAPTER 5. TRACING THE ANCESTRY OF OPERONS IN BACTERIA ...... 32 5.1 The Ancestral-Reconstruction-Gene-Block problem ...... 32 5.2 Solving the ARGB problems ...... 33 5.2.1 Local Algorithm ...... 33 5.2.2 Global Algorithm ...... 38 iv

5.3 Results and Discussion ...... 41 5.3.1 Operons using Escherichia coli as a reference taxon ...... 41 5.3.2 Operons using Bacillus subtilis as a reference taxon ...... 45 5.4 Conclusions ...... 48

CHAPTER 6. UNRAVELING A TANGLED SKEIN: EVOLUTIONARY ANALYSIS OF THE BACTERIAL GIBBERELLIN BIOSYNTHETIC OPERON ...... 50 6.1 Gene clusters analysis ...... 51 6.2 Results and Discussion ...... 59 6.3 Appendix: Supplementary Materials ...... 69

CHAPTER 7. DISCUSSION AND FUTURE DIRECTIONS ...... 85 7.1 Finding Orthologous Gene Blocks in Bacteria: the Computational Hardness of the Problem and Novel Methods to Address It ...... 85 7.1.1 Discussions ...... 85 7.1.2 Future directions ...... 86 7.2 Tracing the Ancestry of Operons in Bacteria ...... 88 7.2.1 Discussions ...... 88 7.2.2 Future directions ...... 89 7.3 Unraveling a Tangled Skein: Evolutionary Analysis of the Bacterial Gibberellin Biosynthesis Operon ...... 90 7.3.1 Discussions ...... 90 7.3.2 Future directions ...... 91

BIBLIOGRAPHY ...... 93 v

LIST OF TABLES

Page Table 2.1 Pairwise events for the orthoblocks shown A-E ...... 8 Table 6.1 Pairwise alignments of GGPS2 proteins and representative GGPS proteins. Protein sequences were aligned using the MUSCLE algorithm (Geneious Prime; default settings). The GGPS proteins included in this analysis were selected due to the relatively close phylogenetic relationship between the core operon of this strain and that of the ggps2-containing strains, which lack a full-length ggps gene (i.e. Rhizobium etli bv. mimosae str. Mim1 has the most similar core operon to the ggps2-containing Rhizobium strains and Bradyrhizobium sp. WSM1417 has the most similar core operon to ggps2- containing Bradyrhizobium strains). Sequences with greater 90% amino acid identity are highlighted in dark blue, while those with greater 80% but less than 90% amino acid identity are highlighted in light blue ...... 69 Table 6.2 List of strains used within the ancestral reconstruction analyses. Shown for each strain is the corresponding GenBank genome accession, the legume host from which it was isolated, and the nodule type (D = determinate, I = indeterminate, * = neither). Note that the legume host and nodule type are irrelevant for the gammaproteobacteria plant pathogens analyzed in these reconstructions...... 70

Table 6.3 Strains containing a CYP 114 − F dGA gene fusion...... 83

Table 6.4 Strains containing a F dGA − SDRGA gene fusion...... 83 vi

LIST OF FIGURES

Page Figure 2.1 Species C has an experimentally determined operon (Black arrows), and serves as the reference taxon. The orthologs in species A, B, D, and E were determined as explained in the text. The events between C and all other species for this orthoblock are A–C: deletion (of gene c); B–C: split (of gene c), C–D: duplication (of b) and split (jagged line); C–E: duplication (of b).7 Figure 2.2 Given a tree rooted at node x and the leaf nodes are labeled with nucleotides, the reconstruction problem finds nucleotide assignment for each inner node. Figure 2.2b is an example of such assignment...... 12 Figure 4.1 The x-axis is the operon/ gene block names, we sort the gene blocks in ascending order according to the number of genes in each block. From left to right: operon ast to operon ygb has 5 genes in the gene block, operon cai to operon ydh has 6 genes in the gene block, operon cas to operon tdc has 7 genes in the gene block, operon bam to operon ydg has 8 genes in the gene block, operon atp to operon yje has 9 genes in the gene block, operon paa to operon yrb has 11 genes, operon hyf has 12 genes, operon nuo has 13 genes. The y-axis is the running time of the method in a log10-scale second 27 Figure 4.2 The x-axis is the same as Figure 2. The y-axis is the number of event count for the orthologous gene block in each species compared to the reference gene block. In Figure 3a,3b,3c, a dot represent the deletion events, split events, sum of deletion events, and split events respectively. Red dots represent the Heuristic method. Blue dots represent the Greedy method. Green dots represent the ILP method...... 30 Figure 5.1 Ancestral reconstruction of operon atpIBEFHAGDC using the global op- timization approach.The lower-case letters in each tree node represent the genes in the orthoblock (e.g. “a” represents “atpA”, see legend in blue bar, top). A ‘|’ designates a split (i.e. a distance ≥500bp between the genes to either side of the ‘|’). The green bar on top shows the total number of events that took place in this reconstruction. The numbers in the brackets in the inner nodes are a 3-tuple showing the cumulative count of events going from the leafnodes to the tree root in the following order: [deletions, duplica- tions, splits]. No orthologus gene blocks were found in species labeled with an asterisk (∗). The reference genome E. coli is marked with a box. These naming and color conventions persist through this study...... 42 Figure 5.2 Ancestral gene block reconstruction of operon paaABCDEFGHIJK using the global reconstruction approach...... 44 Figure 5.3 Ancestral reconstruction of lepA-hemN-hrcA-grpE-dnaK-dnaJ-prmA-yqeU- rimO...... 46 vii

Figure 5.4 Ancestral reconstruction of mmgABCDE-prpB...... 47 Figure 6.1 Diversity among GA biosynthetic operons in divergent bacterial lineages. The core operon genes are defined as cyp112, cyp114, fd, sdr, cyp117, ggps, cps, and ks, as these are almost always present within the GA operon. Other genes, including cyp115, idi, and ids2, exhibit a more limited distribution among GA operon-containing species. The tandem diagonal lines in the Mesorhizobium loti MAFF303099 operon indicates that cyp115 is not lo- cated adjacent to the rest of the operon...... 51 Figure 6.2 Pipeline of the method to reconstruct the ancestral state of gibberellin operon ...... 52 Figure 6.3 Reduced ancestral reconstruction of the GA biosynthetic operon using rpoB for phylogenetic analysis. A phylogenetic tree was constructed with align- ments of rpoB protein sequences from 118 α- species and two gammapro- teobacteria using the Neighbor-Joining method as a measure of the distance between species. The number of analyzed species was reduced to 64 with the Phylogenetic Diversity Analyzer software, and ROAGUE was then applied to create the ancestral operon reconstruction. Annotations are defined in section 6.1...... 54 Figure 6.4 Reduced ancestral reconstruction of the GA biosynthetic operon using the concatenated operon for phylogenetic analysis. A phylogenetic tree was constructed with alignments of concatenated proteins from core GA operon genes (cyp112-cyp114-fd-sdr-cyp117-cps-ks) from 118 α-rhizobia species and two gammaproteobacteria using the Neighbor-Joining method. The num- ber of analyzed species was reduced to 64 with the Phylogenetic Diversity Analyzer software, and ROAGUE was then applied to create the ancestral operon reconstruction. and ROAGUE was then applied to create the ances- tral operon reconstruction. Annotations are defined in section 6.1...... 55 Figure 6.5 Summary of ancestral reconstruction for the GA biosynthetic operon. As a representation of GA operon evolution, the results from the full reconstruc- tion generated using the concatenated operon (FO) are summarized here. The initial loss of the cyp115 gene is indicated with a red arrow, while the reacquisition of this gene is indicated with a green circle at the ancestral node. The loss of ggps and acquisition of ggps2 is indicated by a blue box at the ancestral node. Brackets around a gene represent a variable presence within that lineage. Double slanted lines indicate genes that are not lo- cated within the cluster (i.e. >500 bp away). The family of proteobacterial lineages is indicated to the right of the figure (α and γ labels)...... 58 Figure 6.6 Ancestral reconstruction of the GA biosynthetic operon using rpoB for phylogenetic analysis. A phylogenetic tree was constructed with align- ments of rpoB protein sequences from 118 α-rhizobia species and two γ- using the Neighbor-Joining method as a measure of the dis- tance between species. ROAGUE was then applied to create the ancestral operon reconstruction. Annotations are defined in section 6.1...... 67 viii

Figure 6.7 Ancestral reconstruction of the GA biosynthetic operon using the concate- nated operon for phylogenetic analysis. A phylogenetic tree was constructed with alignments of concatenated proteins from core GA operon genes (cyp112- cyp114-fd-sdr-cyp117-cps-ks) from 118 α-rhizobia species and two β-proteobacteria using the Neighbor-Joining method. ROAGUE was then applied to create the ancestral operon reconstruction. ROAGUE was then applied to create the ancestral operon reconstruction. Annotations are defined in section 6.1. 68 ix

ACKNOWLEDGMENTS

I would like to express my gratitude to my advisor Dr. Oliver Eulenstein. He has been helpful in guiding me on this long journey. I would also like to thank my co-advisor, Dr. Iddo Friedberg, for his continous support and guidance throughout my graduate program of study. My journey has been significantly impacted by Dr. Friedberg. Many thanks to my committee members: Drs.

David Fernandez-Baca, Dennis Lavrov, and Xiaoqiu Huang for their great suggestions toward my scientific researches. I especially grateful to my brother, Duy Nguyen, for his encouragement and support throughout my program of study. Last but not least, I sincerely appreciate my girlfriend,

Nhi Van, who has encouraged me and listened to me when I struggled with difficulties and rejoiced with me at my successes. x

ABSTRACT

A complex system is a group or organization in which many components interact to perform functions that a single component cannot. The study of the evolution of complex systems can reveal the pattern of interactions, the dependencies between entities, the impacts on the system’s environment, etc. In bacterial genomes, a gene block can be thought of as a complex system.

Gene blocks are genes that are co-located on chromosomes, whose evolutionary conservation is essential and useful in phylogenetic and functional studies. Conservation of gene blocks across several species indicates that some of the gene blocks are operons, which are gene blocks that are transcribed together into an mRNA under the control of a single promoter. Previously, an event- based model was introduced to study the evolution of gene blocks in bacteria. It was proven to be practical and efficient in examining the formation of gene block. In this dissertation research, I present three studies on the evolution of gene blocks in bacteria using the event-based model.

In the first study, we introduced the formalization of the event-based model and used it to study the problem of finding orthologous gene blocks in bacteria. Unfortunately, the problem is NP-Hard, which means it is unlikely to have an efficient algorithm to solve the problem exactly. Additionally, it is hard to approximate the problem within a ratio of log2(n)/2 ' .72 ln n where n is the number of genes in the reference gene blocks. Fortunately, it is possible to approximate it within a ratio ln n using a variant of the greedy algorithm of the Set Cover problem. To evaluate the optimality of this method, we formulated the problem as an Integer Linear Program. Using a data set from my previous paper, we compared my method with the state-of-the-art method and showed that mine was more efficient and optimal.

In the second study, we explored the evolution of the homologous gene blocks under the event- based model. Using the maximum parsimony approach, we developed two heuristic algorithms to reconstruct the ancestral state of gene blocks of a group of taxa given a reference gene block. We xi applied these two methods to study the evolution of 55 operons in Escherichia coli and 83 operons in Bacillus subtilis. From the results, we discovered several plausible intermediate states of gene blocks which might be fully or partially functional. From the reconstruction of the ancestral states of gene block, we formed hypotheses on how the gene blocks evolved, and what external forces could have impacted on the course of the evolution.

In the third study, we extended the heuristic algorithm from the second study to trace the evolu- tion of bacterial gibberellin biosynthetic operon. Gibberellin hormones are ubiquitous regulators of growth and developmental processes in vascular plants. Although homologous operons encode GA biosynthetic enzymes in both rhizobia and phytopathogens, there is a clear functional distinction between them. Using the reconstruction algorithm, we characterized losses, gains, and duplications of the gibberellin operon genes within rhizobia and revealed a complex history of horizontal gene transfer of individual genes and clusters of genes.

In conclusion, the event-based model provides a powerful tool to study the evolution of gene blocks in bacteria. Although the problems are NP-hard, heuristic approaches can provide mean- ingful results within a reasonable time. In the future work section, I present directions to extend our studies, as well as problems they might solve. 1

CHAPTER 1. INTRODUCTION

Bacterial genes that encode proteins in a common pathway may cluster together. Those genes may sometimes found in their biochemical reaction order (Demerec and Hartman, 1959). Some- times, blocks of genes (or gene blocks) are co-transcribed together as an operon, sharing the same promoter and regulator, and are, in this case, often associated with a single function. It is esti- mated that 55% of genes of E. coli reside within an operon (Wolf et al., 2001). Operons help cells to express or repress a group of genes efficiently. An essential question in evolutionary study is how the bacterial gene blocks and operons evolve. Several models were proposed to explain the formation of these structures, such as the natal model (Horowitz, 1945), the selfish operon model

(Lawrence and Roth, 1996), the persistent model (Fang et al., 2008), etc. However, each of these models only supported a specific biological reasoning and could not be readily applied for every chosen gene block and taxa. Previously, we proposed the event-based model as a universal model to explain gene block and operon evolution. This model enables us to compute the evolutionary changes among gene blocks from different taxa (Ream et al., 2015) and quantify the conservation of different gene blocks in bacteria.

In Chapter4, we define the mathematical problems of finding orthologous gene blocks in bacteria using an event-based model. We show the problem is extremely hard by reducing it to the Minimum-

Set-Cover problem. In addition, we prove it is hard to approximate within a constant of the optimal.

Later, we introduce a heuristic approach that gives ln n approximation that runs in polynomial time.

To evaluate this approach and the approach in (Ream et al., 2015) , we define an Integer Linear

Programming version of the problem and solve it exactly on small datasets. Then, we compare the accuracy and efficiency of the methods. Overall, our approach runs about 100 times faster than the previous approach and 10,000 faster than the Integer Linear Programming approach. The results 2 of this method are more optimal than the past method’s results. The software accompanying our method is available under GPLv3 license on GitHub: Relevant Operon

In Chapter5, we apply the idea of ancestral sequence reconstruction to trace the ancestral gene block in bacteria using the event-based model. We define a cost function for the gene block assignment of a given tree. We propose a global and a local approaches that minimize the given cost function. Using these approaches, we reconstruct the ancestral states of gene block from E. coli and

B. subtilis for a given group of species. From the results, we can identify the intermediate functional forms of the gene block and study the pattern of conservation. We conclude that essentiality is one of the main reason for gene block conservation. The software accompanying our method is available under GPLv3 license on GitHub: ROAGUE

The plant hormones gibberellins (GAs) are essential to plant growth and development. They regulate the elongation of the stem, seed germination, flower, etc (Yamaguchi, 2008). The research of gibberellins first started in 1926, when a Japanese plant pathologist discovered the fungus Gib- berella fujikuroi had produced a substance that affected the rice stems. It was not until the 1950s that more researchers switched their focuses on gibberellin in plants. Recent studies have revealed that among 100 plus GAs identified, only 4 GAs are bioactive: GA1, GA3, GA4, and GA7. Hence, the other GAs exists as precursors for the bioactive forms or deactivated metabolites (Yamaguchi,

2008). In the 20th century, the plant and fungal GA biosynthesis pathways were fully characterized.

However, GA biosynthesis in bacteria is only characterized recently. Through the characterization of the putative GA gene cluster in Bradyrhizobium japonicum, the bacteria biosynthesis pathway of GA was discovered to encode the enzymes to produce GA9 (Nett et al., 2017b). Analysis of this pathway reveals the GA biosynthesis in bacteria has evolved independently of plants and fungi; however, some essential functionalities of the GA operon are still conserved in bacteria through the encoding genes.

In chapter6, we uncovered reasons why bacteria gain the GAs operon and its role in plant- microbe interactions. We applied my ancestral reconstruction method to study the evolution of gibberellin operon in a group of 120 alphaproteobacteria rhizobia. From the analysis, we traced 3 the loss and gain of several important genes of the operons and found horizontal gene transfer of the whole clusters among the species. 4

CHAPTER 2. RELATED WORKS

2.1 Gene Block Evolution Models

The earliest study of pathway evolution suggested that the first living entity was a completely heterotropic unit, which meant that it could only reproduce using the metabolites in its environ- ment (Horowitz, 1945). As the nutrient in the primitive environment became limited, mutations kicked in so that the entity could utilize other available substances. This process eventually led to the arrangement of genes in blocks. However, later studies showed that a lot of bacterial operons are composed of genes that lacked homology (within an operon) (Lawrence and Roth, 1996). Further- more, proteins metB and metC were coded by unlinked genes despite being highly homologous to each other (similar sequences) and catalyzing successive steps in methionine biosynthesis (Belfaiza et al., 1986). Besides, the natal model inferred that gene positions are not important, and there were no direct benefits to the host. This was later on challenged by the persistent model (Fang et al., 2008).

Another hypothesis called the co-adaptation model applied Fisher’s theory of gene clusters to explain the increased linkage among co-adapted genes (genes function together stay together in the genome). Fisher’s theory explains the natural selection process in eukaryote that favors clustering genes related to each other; hence, it reduces the frequency of recombination events (e.g. crossing

- over during meiosis). This theory was later extended to prokaryotes. Experiments on coliphages

(double-stranded DNA bacteriophages) demonstrated that gene clustering in microorganisms was a natural selection against recombination (Stahl and Murray, 1966). However, the Fisher model requires sufficient recombination to select for gene clusters. Such condition is typical for eukaryotes, but not so much for bacteria, which reproduce asexually.

The co-regulation model suggested that the clustering of genes was facilitated by co-regulation.

If the genes cluster, a single operator can induce and repress them simultaneously (Jacob and 5

Monod, 1961; Pardee et al., 1959). Co-regulation infers an enhancement in fitness for individuals that have gene clusters because the genes can be expressed more economically. However, some genes are co-regulated but not clustered, such as genes arg and pyr in E. coli (Beckwith et al.,

1962). Hence, co-regulation does not imply clustering.

The selfish operon model (Lawrence and Roth, 1996) proposed that the formation of gene clusters in bacteria was the result of the transfer of DNA within and among taxa (Lawrence and

Roth, 1996). This model was quite influential and also controversial. The physical proximity of genes improves the probability of successful horizontal transfer, which can supplement the missing genes of a given cell due to mutations or genetic drifts. This coins the property of being ”selfish”, which means the cluster of genes is only beneficial for gene transfer between species but not the host. This hypothesis was supported in bacteria where horizontal gene transfer was mediated by conjugative plasmids, transducing bacteriophages, etc (Syvanen, 1994). However, clusters that tend to lose genes usually do not code for essential functions. Therefore, this model applied well for genes encoding accessory functions. Hence, essential genes were predicted not to be clustered.

However, the operon rpl, and atp which encodes ribosomal proteins and F1Fo − AT P ase are both arranged closely in blocks. Besides, some genes are clustered in some species but not the other.

For example, pur genes are grouped in B. subtilis (11 genes) but are scattered in S. typhimurium.

Those cases are inconsistent with the selfish operon model. Furthermore, this model could not explain the arrangement in the gene clusters that follows the order of biochemical reactions.

The persistent model explained the persistence of gene blocks once they were formed. Under this model, clustering genes were grouped into persistent genes and rare genes. The model only focused on the clustering of highly persistent genes in bacteria (Fang et al., 2008). The model defined the

Persistent Index (PI) as the percentage of bacteria containing a given gene to find an association between genes tend to cluster and their PI. Interestingly, genes with PI ≥ 65% significantly clustered together. In addition, the model suggested that co-transcription and functional coupling helped to maintain the linkage between those persistent genes. Henceforth, the cluster was less likely to be disrupted by the indel events and could resist mutation more easily. 6

All of these models proposed some explanations for the evolution of gene blocks in bacteria. Each of them had a specific hypothesis that worked exceptionally well in several cases, but not so well in others. Several studies suggested that there might be a model that involved multiple mechanisms.

For example, a model that combined several models was able to lead to gene clustering under appropriate conditions (Ballouz et al., 2010). However, these models did not allow us to quantify the change between gene blocks. Henceforth, it is necessary to have a standard method to measure the evolution of gene blocks. The event-based model was proposed to allow us quickly examine the gene block evolution given a group of taxa and compare the conservativeness of gene blocks in bacteria.

The event-based model explains the evolution of gene blocks that are homologous between different bacteria. There are several essential concepts in this model. Reference taxon is the reference species where we have experimentally verified the operons. Neighboring genes are genes in a genome that are less or equal to 500 nucleotides away from each other. Gene blocks are blocks of genes that have at least a pair of neighboring genes. Orthoblocks are gene blocks that have at least a pair of neighboring genes that are homologous to genes in a gene block in the reference taxon’s genome. Event is a change in the gene blocks between any two species with homologous gene blocks. In this model, the evolution can be represented using 3 types of events: deletion event, duplication event, and split event. Given orthoblocks from different taxon, their pairwise events can be defined as:

1. Splits: If two genes in one taxon are neighboring and their homologs in the other taxon are

not, then there is a single split event.

2. Deletions: A gene exists in the gene block of one taxon, but its homolog cannot be found in

the other taxon.

3. Duplications: If there is a gene g in a gene block in the source genome, and homologous

genes g0, g00 together in a block of the target genome, then there is a duplication event. The

duplication has to occur in a gene block to be tallied. 7

An example that describes the gene block evolution using the event-based model appears in

Figure 2.1. The full list of pairwise events between all species is provided in Table 2.1. The orthoblocks from species A–E are arranged in a species phylogenetic tree in Figure 2.1

a b A

a b c B

a b c C

a b b' c D

a b b' c E

Figure 2.1: Species C has an experimentally determined operon (Black arrows), and serves as the reference taxon. The orthologs in species A, B, D, and E were determined as explained in the text. The events between C and all other species for this orthoblock are A–C: deletion (of gene c); B–C: split (of gene c), C–D: duplication (of b) and split (jagged line); C–E: duplication (of b).

For split event, duplication event, deletion event, the distance between any two orthologous gene blocks found in target organisms is defined as:

1. Split distance (ds): It is the absolute difference in the number of relevant gene blocks between the two taxa. For example, for the reference gene block with genes (abcdefg). Genome A has

blocks ((ac),(bdefg)), and genome B has ((ac),(bde),(fg)). Therefore, ds(A, B) = |2 − 3| = 1

2. Deletion distance (dd): is the symmetric difference of the number of orthologous genes between the different organisms. 8

Table 2.1: Pairwise events for the orthoblocks shown A-E

A B C D E A B split, deletion C deletion split D duplication, duplication, duplication, deletion, 2× split, split split E duplication, duplication, duplication split deletion split

3. Duplication distance (du): is the symmetric difference of the number of duplicated orthologous genes between two orthologous gene blocks, given that the orthologous gene is found in both

organisms.

Using the event distance, we can compute the all pair-wise distance and normalize it for every pair of orthologous gene blocks from a group of species for each event. This enables us to compare the relative conservation of gene blocks in bacteria.

2.2 Finding Orthologous Gene Blocks in Bacteria

In biology, the problem of finding conserved gene clusters has been an ongoing challenge. Since conserved gene clusters tend to share common functionality, detection of conserved gene clusters can help identify new operon functionality. Several methods and approaches were proposed to define and search for orthologous gene blocks. One approach identifies conserved gene clusters using the concept of a “run”, which is a set of genes that occur on the same strand, and the gaps between any pair of consecutive genes are less or equal to 300 bp (Overbeek et al., 1999). This notion paved the first step in connecting the biological knowledge with the mathematical formalization.

However, the 300bp distance was a pretty strict constraint regarding finding gene clusters. Later on, a graph-based algorithm was proposed to identify FRECS (functionally related enzyme clusters)

(Ogata et al., 2000). The main idea is to compare the order of genes in the genome and the cluster 9 of gene products in the pathway. By using a single linkage clustering algorithm, the approach can extract a set of enzymes that are encoded closely in the genome and catalyze successive steps in the metabolic pathway. This method exploited the local similarities between the metabolic pathway and the genome sequence. Although the method can infer conservation of gene cluster, it can not detect conserved gene cluster that does not encode successive reactions. Another popular approach is the gene team model (Luc et al., 2003). In this model, a team of genes of chromosomes A and

B are a set of genes that appear in both A and B. Furthermore, the gap between each pair of consecutive genes must be less or equal to a threshold θ. There are efficient algorithms to find gene teams in an arbitrary number of chromosomes for any threshold θ. This approach was later extended to handle spacial data to detect spatial gene clusters (Schulz et al., 2018). One drawback of this model is the strict gap constraint between any consecutive gene. Secondly, a gene team model does not consider gene duplication since the gene team is a set. On the other hand, the event-based model includes duplication events as orthoblocks are multiset of genes, which will be elaborated more in chapter4. In a sense, the event-based model is a more generalized model of the gene team model in terms of gap constraint and duplication events.

Using the event-based model, we can identify orthologous gene blocks for a group of species.

Previously, a heuristic approach was proposed to recover the maximum number of genes (minimize the deletion distance), prioritize gene blocks (minimize the split distance), and minimize the dupli- cation distance for two given orthologous gene blocks (Ream et al., 2015). Unfortunately, there is no formalization of the problem. This encouraged us to mathematically define the problem, and study its complexity in chapter4.

2.3 Ancestral Reconstruction of Gene Blocks in Bacteria

Another use of the event-based model is the reconstruction of ancestral gene blocks along the evolutionary tree. Ancestral reconstruction is the tracing of the ancestor features given the features of the individuals or populations. By recovering the state of the ancestral population, one can hypothesize the benefits of gaining or losing a feature, infer the environmental pressure that causes 10 adaptations, predicts the future tendency, etc. In biology, the reconstruction of the ancestral sequence, and the amino acid sequence of a protein have been studied extensively. The first formal study of ancestral reconstruction could be linked to (Pauling et al., 1963). Pauling and Zuckerkandl determined the common ancestral amino acid sequence of a group of homologous polypeptide chains using probability. In addition, the difference of amino acid at the same molecular site from two polypeptide chains could reveal which lineage causes the mutation. This study led to several ancestral studies later, such as the study of evolutionary divergence and convergence in protein (Zuckerkandl and Pauling, 1965); the study of ancestral reconstruction models: Maximum parsimony (Swofford and Maddison, 1987), Maximum likelihood (Yang et al., 1995), and Bayesian inference (Yang, 2007), the ancestral states of the structure of folder proteins (Harms and Thornton,

2010), and the evaluation of ancestral protein reconstruction methods (Williams et al., 2006).

The maximum parsimony principle is the idea that one should choose the hypothesis with fewest assumptions to make a prediction. Regarding the ancestral reconstruction field, parsimony chooses the ancestral states within a given tree so that they minimize the total number of evolutionary changes to explain the observed population at the leaf nodes. Maximum parsimony is one of the earliest formal studies of ancestral reconstruction. In character-based parsimony, each taxon is described by a set of characters. In addition, each character can be one in a finite number of states, and in one step, there are specific changes that are allowed for a set of characters. There are several variants of the parsimony model. The Fitch parsimony (standard parsimony) allows change between any two character states, and any changes count just 1 step. The Wagner parsimony only allows change that happens in a particular order. The Dollo parsimony suggests that a character is gained only one time and can never be regained if it is lost (Farris, 1977), etc. Overall, parsimony approaches are easier to implement and more efficient than the maximum likelihood approaches.

However, to compensate for its efficiency, parsimony does not account for variation in rates of evolution. In addition, by favoring the minimum evolutionary changes, parsimony might suffer an unrealistic course of evolution where changes happen often. Most importantly, it does not have a statistical model to estimate uncertainties. 11

The maximum likelihood goal is to find the best estimate of the unknown parameter values based on the observations. This statistical method allows us to measure the confidence of the ancestral reconstructions. In the simplest model of maximum likelihood, all characters have their independent state transitions, and their rate of change is constant over time. In reality, the model can be extended to allow variation in mutation rates by adding more parameters. Given the data at the children node, the maximum likelihood model computes the conditional probabilities of different ancestral reconstructions, and choose the reconstruction that has the highest conditional probability

(Yang et al., 1995). This can be best explained using Figure 2.2. Given a phylogenetic tree rooted at node x, observed data set at the leaf nodes X = {m : A, n : A, o : T, p : G}, parameters θ contains the mutation rates, and the branch length, we want to compute the likelihood of the observed data. For each inner node, there are 4 possibilities (A,G,T,C) to be the label. Therefore, there is a total of 43 = 64 possibilities of assignments. Consider a possible assignment in Figure

2.2b with x : A, y : A, z : T , we can compute its conditional probability as:

P (x : A) × P (y : A|x : A, txy) × P (z : T |x : A, txz) × P (m : A|y : A, tym) (2.1) ×P (n : A|y : A, tyn) × P (o : T |z : T, tzo) × P (p : G|z : T, tzp)

To calculate the likelihood of the observed data, we have to add all possible assignments. The objective of ancestral reconstruction under the maximum likelihood model is to find the assignment to all internal nodes that maximize the likelihood of the observed data for a given tree. One approach is to greedily maximize the likelihood between an ancestral node and its immediate descendants as we traverse the tree from leaf to root. This simplifies our computation problem; however, the likelihood might be worse than the global optimum. The second approach is to consider the joint combination, as I calculated in Equation 2.1, to find the optimal solution. Apparently, the latter approach is more time-consuming. Fortunately, there have been more efficient algorithms that tackle the joint likelihood recently (Pupko et al., 2000). Overall, maximum likelihood methods have greater accuracy than maximum parsimony methods, where variation in rates of evolution is the norm. However, it requires a transitional matrix between the character states, which is not readily available for the event-based model. 12

x:? x:A t t txy xz txy xz

y:? z:? y:A z:T

tym tyn tzo tzp tym tyn tzo tzp

m: A n:A o:T p:G m: A n:A o:T p:G

(a) (b)

Figure 2.2: Given a tree rooted at node x and the leaf nodes are labeled with nucleotides, the reconstruction problem finds nucleotide assignment for each inner node. Figure 2.2b is an example of such assignment.

Because of the simplicity of the maximum parsimony principle, we use it to explore the evolution of gene blocks in bacteria; however, the problems that we defined under maximum parsimony assumptions is already hard. One problem is to minimize the evolution cost between every pair of children nodes and parent node, and the other is to minimize the sum of all evolution cost of the phylogenetic tree. Then we apply these methods to several operons from two reference species

E. coli and B. subtilis. The results provide insights into the evolution of gene blocks and help us understand the functional intermediate state leading to the full operon. 13

CHAPTER 3. PRELIMINARIES

First, we introduce some important definitions of orthologous gene blocks. This helps formaliz- ing the evolutionary cost using the event-based model. Second, we present some typical phylogenetic tree definitions to define our labeling function, and some notions to help with the ancestral recon- struction algorithms that will be presented later in the thesis.

3.1 Event-based model

A reference taxon is a taxon where operons have been reliably identified by experimental means.

Such a taxon serves as a standard of truth to determine if the genes on a suspected orthoblock reside in an operon or a similar co-regulated gene block at least in one species.

An event is a single change in an orthoblock that is characterized as a split, deletion, or duplica- tion. Fig. 2.1 depicts an example of such events. The event-based cost between any two orthoblocks is then defined to be the minimum possible number of splits, duplications, and deletions required to explain the difference between them.

Given a reference operon O, we define G := {x1, x2, x3, ..., xn} to be the set of genes of O.A

λ1 λ2 λn gene block B over G is a non-empty multiset of G defined as B := {x1 , x2 , ..., xn } where xi ∈ G ,

λi ∈ N. We define the set of genes in gene block B as Gene(B) := {xi λi ≥ 1}, the duplication gene set of a gene block B as Dup(B) := {xi λi ≥ 2}, and the size of a gene block B as Size(B) P := λi.

The set of genes in gene block B are defined as Gene(B) := {xi λi ≥ 1}. We also define the duplication gene set of a gene block B as Dup(B) := {xi λi ≥ 2}. An orthoblock O is a set of blocks that is either empty or contains at least one gene block of a size larger than or equal to two. S We define the set of genes of O as Gene(O) := B∈O, and the set of genes that is duplicated in 14

S some gene blocks of O as Dup(O) := B∈O Dup(B). Given a gene block B and a gene set G over

λi G , we define B ∩ G := {xi ∈ B xi ∈ G}. Costs for orthoblocks are described as pairwise functions that calculate the event count from one orthoblock to another, as a proxy for their evolutionary distance. The costs between any two orthoblocks O and O0 are defined as follows.

1. The split cost, denoted as cs, is the absolute difference in the number of relevant gene blocks between the two taxa involved. We define Rel(O,O0) as the set of gene blocks from O where

each gene in each gene block has to appear in O0 at least once. Formally, Rel(O,O0) :=

S 0 B∈O(B ∩ Gene(O )). The split cost can be formalized as:

0 0 0 cs(O,O ) := Rel(O,O ) − Rel(O ,O) (3.1)

[ 0 [ = (B ∩ Gene(O )) − (B ∩ Gene(O)) (3.2) 0 B∈O B∈O

If the orthoblock O0 is the reference gene block, the split cost between two orthoblocks O and

R can be simplified as follows.

cs(R,O) = Rel(R,O) − Rel(O,R) (3.3)

[ 0 [ = (B ∩ Gene(O )) − (B ∩ Gene(R)) (3.4) 0 B∈R B∈O

[ 0 [ 0 = (B ∩ Gene(O )) − (B ∩ Gene(O )) (3.5) 0 B∈R B∈O

[ = 1 − B = 1 −|O| =|O| − 1 (3.6) 0 B∈O

For example, for the reference gene block R = (abcdefg), genome A has blocks O = ((ab), (def)).

We then compute the relevant gene blocks Rel(R,O) = (abdef) (removing genes c, g) and

0 Rel(O,R) = ((ab), (def)). Therefore, cs(O,O ) = |1 − 2| = 1. 15

2. The duplication cost, denoted as (cu), is the pairwise count of gene duplications between two orthoblocks. We define Dif(O,O0) to be the set of duplicated genes of gene block O, such

that these genes also appear in O0 but are not duplicated in O0. Formally, Dif(O,O0) :=

(Dup(O) ∩ Gene(O0)) \ Dup(O0). Here, our gene blocks are guaranteed to have at most one

duplication of each gene for each block. We formalize the duplication cost as follows.

0 0 0 cu(O,O ) := Dif(O,O ) + Dif(O ,O) (3.7)

0 0 = (Dup(O) ∩ Gene(O )) \ Dup(O )

0 + (Dup(O ) ∩ Gene(O)) \ Dup(O) (3.8)

If the orthoblock O0 serves as the reference gene block, we can simplify the duplication cost

between orthoblocks O and R as follows.

cu(R,O) = Dif(R,O) + Dif(O,R) (3.9)

= (Dup(R) ∩ Gene(O)) \ Dup(O)

+ (Dup(O) ∩ Gene(R)) \ Dup(R) (3.10)

= ∅ + (Dup(O) = (Dup(O) (3.11)

For example, for a reference gene block R = (abcde), the genome A has the gene block

O = (abbcc). Observe that the orthologs of genes Ob and Oc are duplicated in genome B. We

0 then compute Dif(R,O) = ∅ and Dif(O,R) = {b, c}. Therefore, cu(O,O ) = 0 + 2 = 2.

3. The deletion cost, denoted as cd, is the difference in the number of orthologs that are in the orthoblocks of the genome of one organism, or the other, but not in both. In other words,

the deletion cost is the symmetric difference between the set of orthologous genes of the two

gene blocks O,O0. We formalize the deletion cost as:

0 0 cd(O,O ) := |Gene(O) 4 Gene(O )| 16

If orthoblock O0 is the reference gene block, we can simplify the split cost between orthoblocks

O and R

cd(R,O) = Gene(R) 4 Gene(O) (3.12)

= Gene(R) − Gene(O) Gene(O) ⊆ Gene(R) (3.13)

For example, for a reference gene block R = (abcde), the genome A has gene block O = (abd),

and the deletion cost between orthoblocks R,O is cd(R,O) = |{a, b, c, d, e} 4 {a, b, d}| = 5 − 3 = 2.

3.2 Phylogenetic tree

Let T be a tree, E(T ) be the edges of tree T , and G be the set of genes of a reference operon.

We define Ω as the set of all possible orthoblocks over gene set G. Let λ : L(T ) 7→ Ω be the labeling of L(T ) (assign orthoblocks from Ω to the leaf nodes of T , this can include empty orthoblocks).

We define the function λˆ : V (T ) 7→ Ω to be an extension of λ on T if it coincides with λ on the leaves of T (assign an orthoblock to each node of T ). If λˆ(v) = O, we say that vertex v is labelled with orthoblock O. Given a labeling λˆ and an edge (u, v) ∈ E, we define the cost between the two labellings of the endpoints u, v as c(u, v) := c(λˆ(u), λˆ(v)) and the total cost function as ˆ P c(λ) := (u,v)∈E c(u, v). Furthermore, given a species tree T and a reference operon O, for node v ∈ V (T ), let O := λˆ(v), we define:

1. Ig(v): the identity function of gene g in O.

2. v.gene[g]: the set that represents whether to include gene g in O. There are only 3 possible

cases.

(a) v.gene[g] = {1} : this means that gene g has to be in O.

(b) v.gene[g] = {0} : this means that gene g can not be in O.

(c) v.gene[g] = {0, 1} : this means that gene g is either in O or not in v. 17

3. v.dup[g]: the set that represents the duplication status of gene g in O. There are only 3

possible cases.

(a) v.dup[g] = {1} : this means that gene g has to be duplicated in O.

(b) v.dup[g] = {0} : this means that gene g can not be duplicated in O.

(c) v.dup[g] = {0, 1} : this means that gene g can either be duplicated or not in O.

4. FREQ (v): The proportion of L(T ) that contains gene g (FREQ (v) = |u|u∈L(Tv)∧Ig(u)=1| ). g v g |L(Tv)|

5. DUPg(v): The proportion of L(Tv) that contains a duplication of gene g 18

CHAPTER 4. FINDING ORTHOLOGOUS GENE BLOCKS IN BACTERIA: THE COMPUTATIONAL HARDNESS OF THE PROBLEM AND NOVEL METHODS TO ADDRESS IT

In bacteria and archaea, gene blocks are sets of genes co-located on the chromosome, which are typically conserved, to some extent, between species. Operons can be viewed as a special case of gene blocks where genes are co-transcribed to polycistronic mRNA and are often associated with related functions, molecular complexes, or both. Such conserved gene blocks have been used in gene function prediction and phylogenetic analyzes (Enault et al., 2003; Overbeek et al., 1999; Srinivasan et al., 2005). Several interesting biological questions could be answered by studying the evolution of gene blocks: which components of the gene block tend to be more conserved? How did the gene block evolve? Given a gene block in a reference genome, which gene blocks from other taxa are orthologous to it?

Annotating sequences of bacterial genomes and determining whether gene blocks are orthologous is essential for our understanding of bacterial genomics and evolution. Recently we have developed a heuristic method to identify novel orthologous gene blocks (Ream et al., 2015). Motivated by this approach, we formalized the biological problem of finding orthologous gene blocks, which we named the Relevant Gene Block problem. In this study, we showed that the Relevant Gene Block problem is NP-hard. We then described a greedy approximation method and assess its accuracy.

We also formulated an Integer Linear Program for an exact solution, which is suitable for conducting smaller studies and can serve as a baseline for evaluating the greedy method. We demonstrated the exceptional scalability and accuracy of our greedy method through extensive comparative empirical studies. From this point on, we will refer the method in (Ream et al., 2015) as the Heuristic method, our method as the Greedy method, an Integer Linear Programming formulation as the ILP method. 19

In this chapter, we use the definition in Chapter3 to define the essential Relevant Gene Block

(RGB) problem. Given a reference gene block, and a set of homologous gene blocks in a target genome, this problem seeks orthologous gene blocks. We prove that this problem is inherently difficult, that is, NP-hard. Furthermore, we show that the RGB problem is unlikely to be efficiently approximated with a ratio bounded by a constant,meaning the problem is APX-hard. Despite these discouraging time complexity results, we describe an efficient greedy algorithm with an O(ln n) approximation for the RGB problem. In addition, we provide an ILP formulation of the RGB problem suitable for smaller sized gene-block studies. Finally, in a comparative study of empirical data, we demonstrate the outstanding performance of our proposed methods in terms of scalability and optimization.

4.1 The Relevant-Gene-Block problem

Under the assumption that the reference gene block is the true operon, our problem is to identify the orthoblocks in our target genome so that the overall cost for the three events is minimized.

Observe that this problem might be challenging since the event costs are not independent of each other because they depend on the set of genes of the orthoblock. First, we describe a mathematical formalization of the problem, which we refer to as the Relevant-Gene-Block (Deletion Duplication

Split) problem. Then we formulate restricted variations of this problem relevant for our time complexity analyzes that involve a reduction from the minimum set-cover problem.

Problem 1. Relevant-Gene-Block (deletion, duplication, split)

Instance: < R, O >, which are the reference gene block, and the set of homologous gene blocks in the target genome.

0 0 0 0 0 0 Find: < O > where O ⊆ O, so that c(R,O ) := cd(R,O ) + cs(R,O ) + cu(R,O ) is minimum.

Problem 2. Relevant-Gene-Block (deletion, split)

Instance: < R, O >, which are the reference gene block, and the cluster of homologous genes in the target genome. 20

0 0 0 0 0 0 Find: < O > where O ⊆ O, so that c(R,O ) := cd(R,O ) + cs(R,O ) = Gene(R) − Gene(O ) +

O0 − 1 is minimum.

Since Gene(R) is essentially the gene set of R, and one is a constant, we can reduce this problem

further into its final form suitable for our analysis.

Problem 3. Relevant-Gene-Block (deletion, split, simplified)

Instance: < R, O >, which are the set of genes of our reference block, and the collection of subset

of the reference gene set respectively.

Find: < O0 > where O0 ⊆ O, so that f(S, O0) := |S| − | ∪ O0| + |O0| is minimum.

We state the minimum set cover problem required for our reduction.

Problem 4. Minimum-Set-Cover

Instance: < R, O >, which are the set of genes of our reference block, and the collection of a subset

of the reference gene set respectively.

Find: < O0 > where O0 ⊆ O, so that ∪O0 covers S, and |O0| is minimum

From now on, we use MSC to stand for the Minimum-Set-Cover problem, and RGB for Relevant-

Gene-Block (deletion, split, simplified) problem.

4.2 The RGB problem is Hard

First, we prove the NP-hardness of the RGB problem by a reduction from the MSC problem.

Our reduction relies on the essential property that the solution of the RGB problem should be a

minimum set covering itself. Using this property, we introduce three lemmas to show the reduction.

Second, we prove that our problem is equivalent in cost to the MSC problem and describe a greedy

algorithm to provide an O(ln n) approximation of the RGB problem. Finally, we describe an ILP

formulation of the RGB problem.

For the reduction, we are going to take an instance of the MSC problem and reduce it to an

RGB instance that has a solution if and only if the MSC instance has a solution. Given an instance

< S, C > of MSC where S is the set of ground elements and C is a collection of a subset of S, we 21 construct the same instance < S, C > for RGB problem, where S is the set of genes and C is the collection of a subset of the reference gene set. Observe that ∪C = S.

Lemma 1. If the set C0 is a solution to our RGB instance < S, C >, then C0 is the minimum set cover of the set ∪C0.

Proof. Assume the contrary: ∃C∗ ⊆ C, ∪C∗ = ∪C0, |C∗| < |C0|. We then have the equality: f(S, C∗) = |S|−|∪C∗|+|C∗| = |S|−|∪C0|+|C∗|. Since |C∗| < |C0|, f(S, C∗) < |S|−|∪C0|+|C0| = f(S, C0). Hence, f(S, C∗) < f(S, C0). This contradicts with C0 is the solution to our RGB .

Therefore, C0 is the minimum set cover of the set ∪C0.

Lemma 2. If the set C0 is a solution to our RGB instance < S, C >, then ∀c ∈ (C \C0), |c\C0| ≤ 1.

Proof. Assume the contrary: ∃c∗ ∈ (C \ C0), |c∗ \ C0| ≥ 2. Consider C∗ = C0 ∪ {c∗} , we have the equality:

f(S, C∗) = |S| − | ∪ C∗| + |C∗|

= |S| − | ∪ (C0 ∪ {c∗})| + |C0 ∪ {c∗}|

= |S| − | ∪ C0| − |c∗ \ C0| + |C0| + 1

≤ |S| − | ∪ C0| − 2 + |C0| + 1 |c∗ \ C0| ≥ 2

= |S| − | ∪ C0| + |C0| − 1

= f(S, C0) − 1

Hence, f(S, C∗) < f(S, C0). This contradicts with C0 is the solution to our RGB . Therefore,

∀c ∈ (C \ C0), |c \ C0| ≤ 1.

Lemma 3. If the set C0 is a solution to our RGB instance < S, C >, ∀c ∈ (C \ C0) and |c \ C0| =

0 0 ∗ 1,C1 := C ∪ {c} is the minimum set cover of ∪(C ∪ {c })

Proof. The proof is straightforward. As long as we add the subset that contributes exactly one to our cover so far, we increase the number of block by 1, and decrease the number of element to cover by 1. Therefore, f(S, C1) does not change and is still the minimum. 22

Algorithm 1: Constructing solution of MSC given solution of RGB Input: C: a collection of subset, C0: a solution of RGB Output: C1: solution of MSC 1 C ← C \ C0 0 2 C1 ← C 3 for c ∈ C do 4 if c \ C1 == 1 then 5 C1 ← C1 ∪ {c}

6 return C1

Claim 1. If the set C0 is a solution to our RGB instance < S, C >, then we can construct the set

C1 that is a solution to MSC instance < S, C > using Algorithm 1 in polynomial time.

Proof. Since C0 is a solution to the RGB instance < S, C >, according to Claim 1, C0 is a minimum set cover of the set ∪C0. By Claim 2, any subset c in the collection of subset C, but not in C0, can contribute at most 1 more element to our set C0. Hence, if we include all of the subsets c that contribute one extra element to our set C1 as in the for loop of Algorithm 1, we eventually cover all of the elements in the set S. Now, using Claim 3, our new set C1, after including each of such subsets c maintains the invariant that: C1 is the minimum set cover of ∪C1. Therefore,

C1 is a minimum set cover of S, and C1 is a solution for the MSC instance < S, C >. We observe that Algorithm 1 runs in polynomial time, as the for loop terminates in at most |C| iterations.

0 Therefore, given that C is a solution to our RGB instance < S, C >, we can construct C1 that is a solution to the MSC instance < S, C > using Algorithm 1 in polynomial time. In addition, the solution to RGB has the same cost as the solution to MSC.

Claim 2. If the set C0 is a solution to our MSC instance < S, C >, then C0 is also a solution to

RGB instance < S, C >.

Proof. Since C0 is a solution to our MSC instance, we have f(S, C0) = |S|−|∪C0|+|C0| = 0+|C0| =

|C0|. 23

Assume the contrary, ∃C∗ ⊆ C, C∗ is a solution to the RGB instance and f(S, C∗) < f(S, C0).

∗ Now, we can use Algorithm 1 to construct C1 that is a set cover of S and f(S, C1) = f(S, C )

(from the Claims 1,2 and 3). This provides the following property: f(S, C1) = |S| − | ∪ C1| + |C1| =

∗ 0 0 ∗ 0 |S|−|S|+|C1| = |C1| = f(S, C ). Therefore, together with f(S, C ) = |C | and f(S, C ) < f(S, C ),

∗ 0 ∗ 0 we have f(S, C ) < |C |. Since we have just proven that f(S, C ) = |C1|, this means that |C1| < |C |

0 and C1 is a set cover of S. Hence, we can construct a set cover C1 that is smaller than that of C . This contradicts our assumption that C0 is a solution to our MSC instance (i.e., C0 is the minimum set cover of S). By proof of contradiction, C0 is also a solution to the RGB instance < S, C >.

Furthermore, the solution to MSC has the same cost as the solution to RGB.

Both, the NP-hardness and the APX-hardness of the RGB problem follow from the previous claims.

Theorem 1. The RGB problem is NP-hard

Proof. From Claim 1 and Claim 2, we conclude that the MSC problem reduces in p-time to the RGB problem. Since the MSC problem is NP-Hard, it follows that the RGB problem is NP-Hard.

Theorem 2. RGB problem is APX-hard within .72 ln n

Proof. We observe that a solution of MSC and a solution of RGB are of equal cost. Therefore, our reduction is also an approximation-preserving reduction. Henceforth, RGB is hard to approximate within .72 ln n.

4.3 Solving the RGB problem

With the NP-hardness result in hand, we know that there should not be an efficient problem to solve the problem exactly. We also showed that the problem is APX-Hard, hence, RGB is unlikely to have an efficient approximation algorithm. However, our reduction shows that the solution of

MSC problem has the same cost as the solution of RGB. Therefore, we devise a greedy algorithm for the RGB problem as follows. 24

4.3.1 Greedy Algorithm

Algorithm 2: Greedy Algorithm to solve RGB problem Input: S: gene set of the reference gene block, C: set of orthologous gene block Output: C0: solution of RGB 1 S0 ← ∪C 2 C0 ← {} 3 while S0 6= ∅ do 4 Select c ∈ C that maximizes |c ∩ S0| 5 S0 ← S0 \ c 6 C0 ← C0 ∪ c

7 return C0

We will show that our algorithm is a ln n approximation of our RGB problem. In order to do so, we need the following lemma.

Lemma 4. Let Copt be the optimal result of our RGB instance < S, C >, and C∗ be the optimal result of the MSC instance < S0, C > as in Algorithm2. We will show that:

f(S, Copt) = |S \ ∪C| + f(S0,C∗)

Proof. Following the Lemma1, Copt is the minimum set cover of ∪Copt. Then, using Algorithm

0 opt 1, we can build a set C1 which is the minimum set cover of S = ∪C so that f(S, C ) = f(S, C1).

0 ∗ Hence, C1 is also an optimal result of the MSC instance < S , C >, which means |C1| = |C |. We 25

opt starts expanding the equality f(S, C ) = f(S, C1) as following:

opt f(S, C ) = f(S, C1)

= |S| − | ∪ C1| + |C1|

= |S| − | ∪ C| + |C1| C1 is a set cover of ∪C

= |S \ ∪C| + |C1|

0 ∗ ∗ 0 = |S \ ∪C| + |S | − | ∪ C | + |C1| C covers S

0 ∗ ∗ ∗ = |S \ ∪C| + |S | − | ∪ C | + |C | |C1| = |C | as shown above

0 0 0 ∗ = |S \ ∪C| + f(S ,C∗) f(S ,C1) = f(S ,C )

Therefore, the proof is concluded.

Theorem 3. The greedy algorithm runs in polynomial time and provides an O(ln n) - approximation of our RGB problem, where n is the cardinality of the set S.

Proof. Firstly, we use the result from this paper (Chvatal, 1979). Given C∗ is the optimal solution to MSC of instance S0,C, and C0 is the output of our alogrithm2 we have:

f(S0,C0) ≤ (1 + ln n)f(S0,C∗)

↔ f(S0,C0) ≤ (1 + ln n)(f(S, Copt) − |S \ ∪C|) From Lemma 4

↔ f(S0,C0) ≤ (1 + ln n)(f(S, Copt − δ)) We can treat |S \ ∪C| as a constant

↔ f(S0,C0) ≤ (1 + ln n)f(S, Copt) δ ≥ 0

Let |S| = n, |C| = m, line 1 and 2 in the algorithm take linear time in n. Regarding the loop, each time we go through the set of orthologous gene block to find the best gene block. Hence, the while loop takes O(mn). The claimed statement follows.

4.3.2 Integer Linear Programming

Given an instance S, C of the RGB problem, we define xi for every set ci ∈ C that takes a value of 1 if we include ci in our answer, and a value of 0 otherwise. We can express our RGB problem 26 as the following integer linear program:

n X minimize xi (4.1) i=1 X subject to xi ≥ 1 ∀g ∈ ∪C (4.2)

i,g∈ci

xi ∈ {0, 1} ∀g ∈ S (4.3)

The size of this version of ILP is the same as the size of ILP version of typical MSC problem, which is less or equal to |C| + |S| ∗ |C|

Theorem 4. Let Copt be the optimal result of our RGB, and C∗ be the optimal result of the ILP formulation above. We can show that:

f(S, Copt) = |S \ ∪C| + f(S0,C∗)

Proof. Since C∗ is an optimal solution for ILP, it is also a solution to an instance < S0, C > of

MSC. Using lemma 4, we can conclude the proof.

4.4 Results and Discussion

Here we evaluate the performance of the Greedy method in approximating the RGB prob- lem. The performance is analyzed by comparing our algorithm with the previously best-known approach, the Heuristic method (Ream et al., 2015), and our exact ILP solution as described in

Section 4.3.2. All of our studies are performed using the previously used benchmark data-set (Ream et al., 2015), which we describe first. In our comparative studies, we first analyze the run-times

(see Section 4.4.1), then the accuracy (see Section 4.4.2), and finally conclude the studies (see

Section 4.4.3). We run all our experiments on a Dell Laptop, Intel Core i7 8th Gen, 15 GB DDR4

RAM running Ubuntu 18.04.4 LTS. The code was written in Python 3.5.

We used the experimentally-identified operons from the E. coli K-12 genome as the gold stan- dard. We chose E. coli because of the high quality of annotation, and a large number of experi- mentally verified operons. The genomes in which we looked for orthoblocks were taken from (Fani 27 et al., 2005). Previously the Heuristic method has been used on this group of taxa (Ream et al.,

2015), and the 55 operons were chosen based on the criteria: each operon comprised of at least 5 genes.

4.4.1 Scalability Study

We compared the run-time of the Greedy method, the Heuristic method, and the ILP approach, using data sets ranging from input sizes of 5 to 13 genes per operon.

4.4.1.1 Experimental Settings.

We compare the running time of 3 methods. In each analysis, we ran the three methods on the same data set (55 gene blocks) and recorded their running time in log 10 (seconds).

4.4.1.2 Results and Discussion.

Number of genes in the operon Naive 3 Approximated ILP 4

5

6 Running time (log10) 7

8 fli liv lsr srl flg lpt thi rpl his trp thr cai glc gcl yia ivb fec ula ast leu fuc glg pst nsr rbs rnc gat atp cys ssu cas cyo yce hyc psp sdh hca chb hya paa hyp nan dpp puu mdt bam men yje (9) tdc (7) ybg (5) ydh (6) ydg (8) hyf (12) yrb (11) nuo (13) Operon Name

Figure 4.1: The x-axis is the operon/ gene block names, we sort the gene blocks in ascending order according to the number of genes in each block. From left to right: operon ast to operon ygb has 5 genes in the gene block, operon cai to operon ydh has 6 genes in the gene block, operon cas to operon tdc has 7 genes in the gene block, operon bam to operon ydg has 8 genes in the gene block, operon atp to operon yje has 9 genes in the gene block, operon paa to operon yrb has 11 genes, operon hyf has 12 genes, operon nuo has 13 genes. The y-axis is the running time of the method in a log10-scale second 28

Overall, as shown in Figure 4.1, we can see that ILP takes the longest time to finish for all the operons except cai, mdt, rbs, paa. The Greedy method is the fastest without exception, beating the Heuristic method by a large margin. On average, the running time of the Greedy method is

2.9 × 10−7 seconds, of the Heuristic method is 8.8 × 10−5 seconds, and of the ILP method is 10−4 seconds. This means that the Greedy method should be around 100 times faster than the Heuristic method, and 1,000 times faster than the ILP method. This is expected because the Heuristic method tries to generate almost all possible combinations of gene blocks, and the ILP method solves the problem exactly. To our surprise, the Heuristic method actually performs slowest in the 4 cases of cai, mdt, rbs, paa. The reason is not due to those operons have more genes, but because they have more potential orthologous gene blocks. Although the Heuristic method only takes milliseconds to finish in the worst case, we only surveyed 55 operons within a group of 33 taxa. Currently, NCBI has more than 26,000 bacterial genomes. The running time will become important when running on more complex operons and a larger number of species. In sum, the

Greedy method performs the best in terms of run time.

4.4.2 Accuracy Study

We evaluate the methods in terms of the event-based cost function. That is, we say that the method that can reconcile the blocks with the reference operon using the smallest number of split and deletion events is more accurate. The reasoning is that a lower cost is the most parsimonious explanation for the evolutionary distance between any two orthoblocks as explained in (Nguyen et al., 2019). This is somewhat analogous to the process of pairwise protein sequence alignment by assigning a cost function based on the costs of indels and substitutions.

4.4.2.1 Experimental Settings

We generated the best orthologous gene block in each of the 33 species using three different methods. We then calculated the total number of deletion events and split events for each operon for each model and present it in Figure 4.2a and Figure 4.2b. Then, for each model, we calculated 29 the sum of deletion events and split events for each operon (Figure 4.2c) that represents our cost function. The method is considered better if the number of events is lower.

4.4.2.2 Results and Discussion.

As shown in Figure 4.2a, in most cases, the three models have the same number of deletion events. In 18 out of 55 cases, the Greedy method has a smaller number of deletion of events than the Heuristic method, and they have the same amount of deletion events in the rest of the cases.

In all cases, ILP has the lowest number of deletion events. As shown in Figure 4.2b, the three models have the same number of split events in most cases. However, in 13 of 55 operons, the

Heuristic method has a lower number of split events than the Greedy method. In 3 of 55 operons, the Greedy method has a lower number of splits than the Heuristic method. Figure 4.2c conveys the cost function (sum of deletion and split events) for each of the methods. In 15 of 55 operons, the Greedy method has a lower cost than the Heuristic method. The Greedy and Heuristic methods perform similarly on the other operons. In all of the cases, the ILP method always has the lowest cost, as it is an exact method by design. Following that, the Greedy method is more accurate than the Heuristic method in 15 of 55 operons. 30

250 Number of genes in the operon Naive 200 Approximated ILP s t

n 150 e v E n o i t 100 e l e D 50

0 i i i l l l t t t t ) t ) ) ) ) r ) ) ) r r s s s s c c c c c v a a a a o p g e a n u b p p g p b p n h u u r i l c i a l i l s s fl p h s s m a d p 7 t 3 r 6 h 8 2 5 9 1 y a y l s s e c b u c n y y l s v a y a e p e fl u h d l t r ( ( ( ( h c t t ( g g y i a f u f l g p n r a r g a c 1 s c 1 1

c y h p s h c h h p n d p m ( ( ( m b c e h g g j f d o d d b b y t y r y y y u h y n Operon Name (a)

80 Number of genes in the operon Naive Approximated 60 ILP

40 SplitEvents 20

0 fli liv lsr srl flg lpt thi rpl his trp thr cai glc gcl yia ivb fec ula ast leu fuc glg pst nsr rbs rnc gat atp cys ssu cas cyo yce hyc psp sdh hca chb hya paa hyp nan dpp puu mdt bam men yje (9) tdc (7) ybg (5) ydh (6) ydg (8) hyf (12) yrb (11) nuo (13) Operon Name (b)

Number of genes in the operon 300 Naive 250 Approximated ILP 200

150

100

50 Split events + Deletion

0 fli liv lsr srl flg lpt thi rpl his trp thr cai glc gcl yia ivb fec ula ast leu fuc glg pst nsr rbs rnc gat atp cys ssu cas cyo yce hyc psp sdh hca chb hya paa hyp nan dpp puu mdt bam men yje (9) tdc (7) ybg (5) ydh (6) ydg (8) hyf (12) yrb (11) nuo (13) Operon Name (c)

Figure 4.2: The x-axis is the same as Figure 2. The y-axis is the number of event count for the orthologous gene block in each species compared to the reference gene block. In Figure 3a,3b,3c, a dot represent the deletion events, split events, sum of deletion events, and split events respectively. Red dots represent the Heuristic method. Blue dots represent the Greedy method. Green dots represent the ILP method.

4.4.3 Concluding Discussion

From our scalability and accuracy study, we observe that the Greedy method is the fastest, being 100× faster than the heuristic method, and 1000× faster than the exact ILP method. In 31 terms of accuracy, the Greedy method follows the ILP method very closely and either outperforms the Heuristic method or has the same results for all of the 55 operons. Therefore, on our small dataset that was used in (Ream et al., 2015), our Greedy method is outperforming the Heuristic method in both scalability and accuracy.

4.5 Conclusion

Finding orthologous gene blocks is an essential step in understanding the evolution of gene blocks and complexity in bacterial genomes. In this study we formally define the problem of identifying orthologous gene blocks given a reference operon, prove that it is NP-hard, and present an algorithm that guarantees O(ln n) approximation and runs in polynomial time. In addition, we designed an

ILP formulation and proved that it could solve our problem exactly. In our experimental study, we demonstrated that the Greedy method performs better than the Heuristic method in terms of both accuracy and scalability. We note that the methods developed in this work cannot handle the duplication events. While those are relatively rare, handling duplications would be discussed in chapter7 32

CHAPTER 5. TRACING THE ANCESTRY OF OPERONS IN BACTERIA

Complexity is a fundamental attribute of life. Complex systems are made of parts that together perform functions that a single component, or subsets of components, cannot. Ancestral recon- struction is essential in the evolution history of biological characteristics of complex systems. The reconstruction can provide a peek into how complex systems evolve and why the systems develop in a specific way. In this chapter, we focus on the problem of reconstructing ancestral states of gene block in bacteria. Reconstructing plausible ancestral states of extant gene blocks and operons can help us understand how gene blocks evolve, identify possible functional intermediate states, and determine which forces might affect their evolution.

We use the definition in Chapter3 to define the important Ancestral-Reconstruction-Gene-

Block (ARGB) problems. Given a reference gene block, a phylogenetic tree where each leaf node is populated a the homologous gene blocks, this problem seeks a labeling of gene block for each of the inner nodes. We introduce two versions of the problem and present their respective algorithms.

In the result section, we conduct several studies on the operons of E. coli and B. subtilis as well as form hypotheses on the evolution of operons of E. coli and B. subtilis.

5.1 The Ancestral-Reconstruction-Gene-Block problem

Our problem is to find the labeling that minimizes the overall cost for the three evolutionary events. We now present the local ARGB problem and the global ARGB problem

Problem 5. Ancestral-Reconstruction-Gene-Block (local)

Instance: < T, G, λ >, which are the phylogenetic tree, the set of genes in the reference gene block, and the labeling of L(T )

Find: < λˆ > so that for every node u and its immediate children u1, u2, c(u, u1) + c(u, u2) is minimum. 33

Problem 6. Ancestral-Reconstruction-Gene-Block (global)

Instance: < T, G, λ >, which are the phylogenetic tree, the set of genes in the reference gene block, and the labeling of L(T )

Find: < λˆ > so that c(λˆ) is minimum

5.2 Solving the ARGB problems

Here, we explore two related maximum parsimony heuristic approaches, local and global, to reconstruct the ancestral gene blocks. In each approach, we will present an algorithm, and an optimality analysis.

5.2.1 Local Algorithm

Briefly, the local approach focuses on finding the optimal parent ancestral gene block given its children gene blocks. For each internal node u, let u1 and u2 be its two children. The intuition is that we have to include a gene in the parent if both of the children have it. However, greedily propagating an included gene up a tree may cause predicting its ancestral existence into deeper internal nodes than is warranted. To check this problem, at each tree vertex v, for each gene g, we introduce a correction by checking the fraction of the leaf nodes that contain g. Since gene loss tends to happen more often than gene gain, we use a threshold of 0.5 to indicate whether v contains gene g or not. To that end, we developed a greedy local optimization algorithm3 34

5.2.1.1 Algorithm

Algorithm 3: Local approach Input: T,G,λ

Result: λˆ

1 for internal node u when traverse T in post-order do

2 Let u1, u2 be the children of u

3 Let O1,O2 be the orthoblock assigned to u1, u2

4 Let initial := O1 ∪ O2

5 Remove genes in initial that has FREQg(u) <= .5

6 Remove duplicated genes in initial that has

DUPg(u) <= .5

7 λˆ(u) := initial

8 Return λˆ

5.2.1.2 Optimality

Let λˆ be the result of our Algorithm 3 with input (T, G, λ). For each u ∈ I(T ), let u1, u2 be ˆ ˆ ˆ its children and O := λ(u),O1 := λ(u1),O2 := λ(u2). We will show that cd(O,O1) + cd(O,O2) and cu(O,O1) + cu(O,O2) are local minimal.

Lemma 5. ∀g ∈ G, if FREQg(u) > .5 then either FREQg(u1) > .5 or FREQg(u2) > .5

Proof. We assume the contradiction, which means that FREQg(u1) ≤ .5 and FREQg(u2) ≤ .5, then

  |L(u1)|  |{v ∈ L(u1)|g ∈ Gene(λ(v)}| ≤ 2  |L(u2)|  |{v ∈ L(u2)|g ∈ Gene(λ(v)}| ≤ 2

Define H := {v ∈ (L(u1) ∪ L(u2))|g ∈ Gene(λ(v)}, from the two inequalities above, we have: 35

|L(u1)| |L(u2)| H ≤ + 2 2

Since u1, u2 are the children of u, then   L(u1) ∪ L(u2) = L(u)

 L(u1) ∩ L(u2) = ∅

|L(u)| → {v ∈ L(u)|g ∈ Gene(λ(v)} ≤ 2

→ FREQg(u) ≤ .5

By contradiction, either FREQg(u1) > .5 or FREQg(u2) > .5

Lemma 6. ∀g ∈ G, if FREQg(u) ≤ .5 then either FREQg(u1) ≤ .5 or FREQg(u2) ≤ .5

Proof. The proof is the same as for Lemma5

0 0 Lemma 7. ∀g ∈ G and ∀O ∈ Ω, if g ∈ Gene(O) and g∈ / Gene(O ), then |Ig(O) − Ig(O1)| +

0 0 |Ig(O) − Ig(O2)| ≤ |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)| .

Proof. Since g ∈ Gene(O), then FREQg(u) > .5. Therefore, FREQg(u1) > .5 or F REQg(u2) > .5

(by Lemma5). Hence, g ∈ Gene(u1) or g ∈ Gene(u2). Consider 3 cases:

1. If u1 and u2 contain g, then

|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |1 − 1| + |1 − 1| = 0

0 0 |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)| = |0 − 1| + |0 − 1| = 2

0 0 Therefore, |Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| < |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|

2. If only u1 contains g, then

|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |1 − 1| + |1 − 0| = 1

0 0 |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)| = |0 − 1| + |0 − 0| = 1

0 0 Therefore, |Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)| 36

3. If only u2 contains g, then

|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |1 − 0| + |1 − 1| = 1

0 0 |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)| = |0 − 0| + |0 − 1| = 1

0 0 Therefore, |Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|

0 From the above cases, we conclude that |Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| ≤ |Ig(O ) − Ig(O1)| +

0 |Ig(O ) − Ig(O2)|

0 0 Lemma 8. ∀g ∈ G, and ∀O ∈ Ω, if g∈ / Gene(O) and g ∈ Gene(O ), then |Ig(O) − Ig(O1)| +

0 0 |Ig(O) − Ig(O2)| ≤ |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)| .

Proof. Since g∈ / Gene(O), then FREQg(u) < .5. Therefore, FREQg(u1) < .5 or F REQg(u2) < .5

(by lemma 1. Hence, g∈ / Gene(u1) or g∈ / Gene(u2). Consider 3 cases:

1. If u1 and u2 do not contain g, then

|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |0 − 0| + |0 − 0| = 0

0 0 |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)| = |1 − 0| + |1 − 0| = 2

0 0 Therefore, |Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| < |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|

2. If only u1 does not contain g, then

|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |0 − 0| + |0 − 1| = 1

0 0 |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)| = |1 − 0| + |1 − 1| = 1

0 0 Therefore, |Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|

3. If only u2 does not contain g, then

|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |1 − 1| + |1 − 0| = 1

0 0 |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)| = |0 − 1| + |0 − 0| = 1

0 0 Therefore, |Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| = |Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|

0 From the above cases, we conclude that |Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)| ≤ |Ig(O ) − Ig(O1)| +

0 |Ig(O ) − Ig(O2)| 37

We can now state our theorems regarding the deletion costs and duplication costs

Theorem 5. Let λˆ be the result of our Algorithm3 with input (T, G, λ), we define I(T ) be the inner node of tree T , and T (u),L(u) are the subtree rooted at any given node u, and the set of leaf nodes of the subtree at node u. For each u ∈ I(T ), let u1, u2 be its children. We claim that cd(O,O1) + cd(O,O2) is minimum

0 0 0 Proof. Given O ∈ Ω, we will show that cd(O ,O1) + cd(O ,O2) ≥ cd(O,O1) + cd(O,O2) We have:

0 0 X 0 0 cd(O ,O1) + cd(O ,O2) = (|Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|) g

X 0 0 = (|Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|) + g∈O0

X 0 0 (|Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|) g∈ /O0

X 0 0 = (|Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|) + (5.1) g∈O0,g∈O

X 0 0 (|Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|) + g∈O0,g∈ /O

X 0 0 (|Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|) + g∈ /O0,g∈O

X 0 0 (|Ig(O ) − Ig(O1)| + |Ig(O ) − Ig(O2)|) g∈ /O0,g∈ /O 38

On the other hand, we have:

X cd(O,O1) + cd(O,O2) = (|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)|) g X = (|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)|) + g∈O X (|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)|) g∈ /O X = (|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)|) + (5.2) g∈O0,g∈O X (|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)|) + g∈O0,g∈ /O X (|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)|) + g∈ /O0,g∈O X (|Ig(O) − Ig(O1)| + |Ig(O) − Ig(O2)|) g∈ /O0,g∈ /O

From Lemma7, and Lemma8, we have:

0 0 cd(O,O1) + cd(O,O2) ≤ cd(O ,O1) + cd(O ,O2)

Theorem 6. Let λˆ be the result of our Algorithm3 with input (T, G, λ), we define I(T ) be the inner node of tree T , and T (u),L(u) are the subtree rooted at any given node u, and the set of leaf nodes of the subtree at node u. For each u ∈ I(T ), let u1, u2 be its children. We claim that cu(O,O1) + cu(O,O2) is minimum

Proof. The proof follows the same logic as deletion costs, except this time, we consider DUPg(u) instead of FREQg(u)

5.2.2 Global Algorithm

Here we try to achieve minimal deletion and duplication costs globally. Intuitively, for each node v and each gene g in the reference operon, we decide whether gene g could appear in the 39 orthoblock that we will assign to v. To do this, we use dynamic programming. By traversing the phylogenetic tree bottom-up and top-down, we determine the occurrence of each gene in the reference operon for each node v. We also determine whether a gene should be duplicated in the same manner. For split costs, we generate the relevant gene blocks of the two children given the set of genes to be included.

5.2.2.1 Algorithm

Algorithm 4: Global approach Input: T,G, λ

Result: λˆ

1 for each gene g in G do

2 Determine whether to include gene g in gene set of each

internal node u using Fitch algorithm

3 Determine whether to include gene g in duplicated gene

set of each internal node u using Fitch algorithm

4 for internal node u when traverse T in post-order do

5 Let u1, u2 be the children of u

6 Let O1 := λ(u1), O2 := λ(u2)

7 if |Rel(O1, Gene(u))| < |Rel(O2, Gene(u))| then

8 initial := Rel(O1, Gene(u))

9 else

10 initial := Rel(O2, Gene(u))

11 λˆ(u) := initial

12 Return λˆ 40

5.2.2.2 Optimality

ˆ X ˆ ˆ Let λ be the result of our Algorithm4 with input ( T, G, λ). We will show that cd(λ(v1), λ(v2))

(v1,v2)∈E(Tu) X and cu(λˆ(v1), λˆ(v2)) are the global minimum for subtree Tu.

(v1,v2)∈E(Tu) X Lemma 9. ∀g ∈ G, ∀u ∈ I(T ), Ig(λˆ(v1)) − Ig(λˆ(v2)) is minimum

(v1,v2)∈E(Tu) Proof. Any node u ∈ I(T ) can only be of 3 types: the node that has children that are both leaf nodes, the node that has one child as a leaf node, and one as an inner node, the node that has two children that are both inner node. We denote them as type 1, type 2, type 3, respectively. We can then prove the lemma by induction as we traverse the tree T in post-order. The proof is similar to the proof of the Fitch algorithm, as we consider the value of each node can only take a value of 0 or 1.

Theorem 7. Let λˆ be the result of our Algorithm 4 with input (T, G, λ), we define I(T ) be the inner node of tree T , and T (u),L(u) are the sub tree rooted at any given node u, and the set of leaf X ˆ ˆ nodes of the subtree at node u. We will show that cd(λ(v1), λ(v2)) is global minimum

(v1,v2)∈E(Tu) for subtree Tu

Proof. Indeed, as we can rewrite the objective function as:

X ˆ ˆ X X ˆ ˆ cd(λ(v1), λ(v2)) = Ig(λ(v1)) − Ig(λ(v2))

(v1,v2)∈E(Tu) g∈G (v1,v2)∈E(Tu)

Because each gene g ∈ G are independent of each other, we can add each individual sum and have the minimum.

Since our theorem holds for every inner node, it also holds for the root. Therefore, the global approach provides optimal deletion costs. Similarly, we can prove the same for duplication costs. 41

5.3 Results and Discussion

We used experimentally-identified operons from E. coli K-12 and B. subtilis str. 168 genomes as gold standards for deriving operons from Gram-negative and Gram-positive bacteria, respectively.

The reason we chose these two species is that they both have well-annotated genomes, including experimentally verified and functionally annotated operons.

5.3.1 Operons using Escherichia coli as a reference taxon

We chose E. coli as the reference species for proteobacteria, a major group of Gram-negative bacteria. Our selection resulted in a set of proteobacteria species comprising three -proteobacteria, six α-proteobacteria, seven β-proteobacteria, and 17 γ-proteobacteria, including E. coli. These taxa include two γ-proteobacteria insect endosymbionts: Buchnera aphidicola and Candidatus Blochma- nia. These two species have unusually small genomes due to their endosymbiotic nature and display massive gene loss. We reconstructed ancestors for the following operons from E. coli (described below): atpIBEFHAGD and paaABCDEFGHIJK.

atpIBEFHAGDC . The atpIBEFHAGDC operon codes for F1Fo-ATPase, which catalyzes the synthesis of ATP from ADP and inorganic phosphate (Kasimoglu et al., 1996). ATP synthase is composed of two fractions: F1 and Fo (Senior, 1990). The F1 fraction contains the catalytic sites, and its proteins are coded by five genes (atpA, atpC, atpD, atpG, atphH ) (Senior, 1990). The Fo complex constitutes the proton channel, and its proteins are coded by three genes atpF, atpE, atpB. atpI is a non-essential regulatory gene. Figure S3 shows the high degree of conservation of this operon.

Figure 5.1 shows ancestral reconstruction using the global maximum parsimony algorithm. Both local (Fig S1) and global reconstructions show the consistency of having orthoblocks atpACDGH and atpBF in the most common ancestors for different Gram-negative bacteria. This finding agrees with the long-standing hypothesis that the Fo and the F1 fractions have evolved separately, with the two fractions having homologs in the hexameric DNA helicases and with flagellar motor complexes.

Although we find the gene atpI in several species, the reconstruction predicts that atpI is not in 42

ε-protebacteria

α-protebacteria

β-protebacteria

γ-protebacteria

Figure 5.1: Ancestral reconstruction of operon atpIBEFHAGDC using the global optimization approach.The lower-case letters in each tree node represent the genes in the orthoblock (e.g. “a” represents “atpA”, see legend in blue bar, top). A ‘|’ designates a split (i.e. a distance ≥500bp between the genes to either side of the ‘|’). The green bar on top shows the total number of events that took place in this reconstruction. The numbers in the brackets in the inner nodes are a 3-tuple showing the cumulative count of events going from the leafnodes to the tree root in the following order: [deletions, duplications, splits]. No orthologus gene blocks were found in species labeled with an asterisk (∗). The reference genome E. coli is marked with a box. These naming and color conventions persist through this study. 43 the same cluster with other genes. As stated, atpI is probably not an essential component of the

F1Fo ATPase (NJ, 1984). Another interesting finding is the duplication of atpF in -proteobacteria, which appears to predate their common ancestor. Note that all genes exist as a gene block, even in the endosymbionts Blochmannia and B. aphidicola.

The , α, β, and γ -proteobacteria species all have a conserved intact F1 complex (coded by the atpACDGH cluster), which predates their common ancestor. The genes included in the Fo complex in epsilon-proteobacteria (gene products atpB, atpE,atpF ) not in the same cluster as the genes making up F1. Furthermore, it is unclear whether the gene split that is only found in - proteobacteria is a split that predates the least common ancestor with the other proteobacteria clades, or whether it is a split introduced in the -proteobacteria. From the reconstructions pro- vided, the scenario appears to be the latter. Conversely, this observation may also be a result of the small number of species studied here. The species in the  and α-proteobacteria display a known duplication of gene atpF. atpF 0 appears as a sister group to atpF (Koumandou and Kossida, 2014).

paaABCDEFGHIJK . The operon paaABCDEFGHIJK codes for genes involved in the catabolism of phenylacetate (Martin and McInerney, 2009). The ability to catabolize phenylacetate varies greatly between proteobacterial species, and even among different E. coli K-12 strains. In contrast with atpABCDEFG operon, which is conserved through many species, the operon paaABCDE-

FGHIJK is only found in full complement as an operon in some E. coli K-12 strains and some

Pseudomonas putida strains. While obviously less conserved than the atpABCDEFG operon, cer- tain orthoblocks appear to be conserved, providing possible partial functionality. The orthoblock paaABCDE is found in three Bordetella species and also in Bradyrhizobium diazoefficiens. The products of paaA, paaB, paaC, and paaE make up the subunits of the 1,2-phenylacetyl-CoA epox- idase, and paaD is hypothesized to form an iron-sulfur cluster with the product of paaE (Grishin et al., 2011). We did not find orthologs in the endosymbionts B. aphidicola and Blochmannia.

In both the local and global reconstructions (Fig S2 and Figure 5.2 respectively), only the ancestor of the Bordetella species have a combination of paaABC complex with paaE. It appears that this combination has full activity (Grishin et al., 2011). In addition, the global approach 44

ε-protebacteria

α-protebacteria

β-protebacteria

γ-protebacteria

Figure 5.2: Ancestral gene block reconstruction of operon paaABCDEFGHIJK using the global reconstruction approach. 45 predicts gene blocks just for the ancestors of α and most of γ-proteobacteria. Only the common ancestor of the Bordetella genus contains the cluster paaABCE. It has been confirmed that this cluster of genes is identical to those of E. coli (Luengo et al., 2001). In both approaches, gene paaF and paaG are not found to be in the same gene blocks. Hence, the ancestors are most likely missing the hydratase-isomerase complex. The paaJ thiolase catalyzes two steps in the phenylacetate catabolism (Ismail et al., 2003; Teufel et al., 2010; Nogales et al., 2007). In addition, paaH is the

NAD+-dependent 3-hydroxyadipyl-CoA dehydrogenase involved in phenylacetate catabolism(Ismail et al., 2003). Therefore, it makes sense that paaJ and paaH appear in most of the ancestral nodes that have gene blocks.

It is interesting to note that we see the formation of intermediate functional forms both in a highly conserved gene block atpIBEFHAGDC and the less conserved gene block based on the operon paaABCDEFGHIJK. Also, in both cases, the global approach performs better in terms of minimizing events. For brevity, we only provide the global ancestral reconstruction henceforth.

5.3.2 Operons using Bacillus subtilis as a reference taxon

B. subtilis is a Gram-positive, spore-forming bacterium commonly found in soil, and is also a normal gut commensal in humans. It is a model organism for Gram-positive, spore-forming bacteria, and as such its genome of about 4,450 genes is well annotated. Here we used ROAGUE to reconstruct the ancestors of two B. subtilis operons across 33 species. We selected species from the order Bacillales using PDA. Species from the following families were selected: Bacillaceae

(including the reference organism B. subtilis), Staphylococcae: macrococcus and staphylococcus,

Alicyclobacillaceae, Listeriaceae, and Planococcaceae.

lepA-hemN-hrcA-grpE-dnaK-dnaJ-prmA-yqeU-rimO. Gene block lepA-hemN-hrcA-grpE- dnaK-dnaJ-prmA-yqeU-rimO facilitates the heat shock response in B. subtilis, and the gene block hrcA-grpE-dnaK-dnaJ was the first identified heat shock operon within Bacillus s[[ (Wetzstein et al., 1992). The four genes hrcA, grpE, dnaK, dnaJ (e,c,b,a in Figure 5.3) form a tetracistronic structure, which is essential to the heat shock response role (Homuth et al., 1997). The four genes 46

Macrococcus

Paenibacillaceae

Staphylococcus

Alicyclobacillaceae

Bacillaceae Bacillales Family XII Listeriaceae

Planococcaceae

Figure 5.3: Ancestral reconstruction of lepA-hemN-hrcA-grpE-dnaK-dnaJ-prmA-yqeU-rimO. 47

Macrococcus

Paenibacillaceae

Staphylococcus

Alicyclobacillaceae

Bacillaceae Bacillales Family XII Listeriaceae

Planococcaceae

Figure 5.4: Ancestral reconstruction of mmgABCDE-prpB. 48 are proximal (they never separated in the course of evolution) in all the species examined, and form the core of the orthoblock. Overall, this operon is quite conserved, and the ancestral reconstructions are highly similar to the reference operon.

mmgABCDE-prpB. The operon mmgABCDE-prpB is expressed during endosporulation (Acharya,

2009). Subunit mmgABC ’s breakdown of fatty acids is a means for attaining energy to drive the cell’s preparation for dormancy (Quattlebaum, 2009). Hence, it is reasonable to see that the com- mon ancestor has this subunit. In addition, gene mmgD and gene prpB/yqiQ are predicted to be proximal. Several studies predicted that gene mmgD, prpB, and prpD encode the proteins of the putative methylcitrate shunt (Voigt et al., 2007). However, they did not specify if deletion mutations might contribute to a defect of the functionality. See Figure 5.4.

5.4 Conclusions

Operons offer a tractable model for the evolution of complexity. Understanding how simple units of genes may converge into an operon can lead us to a better understanding of how complex molecular systems evolve. Here we developed a method to reconstruct ancestral gene blocks using maximum parsimony. Using this method, we provide several examples of ancestral gene block reconstructions based on reference operons in E. coli and B. subtilis. Some interesting observations emerge regarding conservation and the ancestry of operons. From our examples, it appears that essentiality (the trait of being essential to life) and the formation of a protein complex are two drivers for gene block conservation. This is most apparent in the atpABCDEFG operon coding for

F1Fo-atpase in proteobacteria. There are few evolutionary events identified in the atpABCDEFG operon ancestry. In the Supplementary Materials, we provide a brief study of two more gene blocks. We observed that the ribose transporter block also seems to preserve the core ribose transporter (rbsABC ), while not the ribose phosphorylation genes rbsD and rbsK. ROAGUE also highlights intermediate functional forms of the orthoblocks, as we see in the pattern of conservation in paaABCDEFGHIJK. 49

It did not escape our notice that our model does not account for horizontal gene transfer, which has been shown to be a driver of operon dispersal in some cases species (Omelchenko et al.,

2003; Koonin, 2009). However, our model does set the stage for a new method for doing so.

Typically, detecting horizontal gene transfer is done by looking for the conservation of genes and gene structures between distant OTUs, and for anomalous codon usage (Koonin et al., 2001). Our method opens up a new way of HGT detection, by reconciling a species tree with an operon tree, in the same way that phylogenomic analyses do for gene trees and species trees (Eisen, 1998), which would be an exciting future development of this study. In addition, we ignore the gene order in the gene block. While the relationship between gene organization and expression in operons is not well understood, it is clear from several studies that gene order has an effect in some cases on the expression and functionality of the operon in general (e.g. (Hiroe et al., 2012; Wells et al., 2016;

Lim et al., 2011)). Adding the parameters of horizontal gene transfer, gene order preservation, or both to ROAGUE would be highly valuable. We will discuss potential extensions of our research in chapter7 50

CHAPTER 6. UNRAVELING A TANGLED SKEIN: EVOLUTIONARY ANALYSIS OF THE BACTERIAL GIBBERELLIN BIOSYNTHETIC OPERON

Gibberellin (GA) phytohormones are ubiquitous regulators of growth and developmental pro- cesses in vascular plants. The convergent evolution of GA production by plant-associated bacteria, including both symbiotic, nitrogen-fixing rhizobia, and phytopathogens, suggests that manipula- tion of GA signaling is a powerful mechanism for microbes to gain advantage in these interac- tions. Although homologous operons encode for GA biosynthetic enzymes in both rhizobia and phytopathogens, notable genetic heterogeneity and scattered operon distribution in these lineages suggest distinct functions for GA in varied plant-microbe interactions. Thus, deciphering GA operon evolutionary history could provide crucial evidence for understanding the distinct biological roles for bacterial GA production. To further establish the genetic composition of the GA operon, two operon-associated genes that exhibit limited distribution among rhizobia were biochemically characterized, verifying their roles in GA biosynthesis. In this chapter, we employ the previous maximum-parsimony ancestral gene block reconstruction algorithm to characterize loss, gain, and horizontal gene transfer (HGT) of GA operon genes within alphaproteobacteria rhizobia, which ex- hibit the most heterogeneity among GA operon-containing bacteria. Collectively, this evolutionary analysis reveals a complex history for HGT of both individual genes and the entire GA operon, and ultimately provides a basis for linking genetic content to bacterial GA functions in diverse plant-microbe interactions.

Within this chapter, the abbreviated term “α-rhizobia” is used to refer to symbiotic, nitrogen-

fixing rhizobia that belong to the alphaproteobacteria class. Likewise, the term “β-rhizobia” refers to symbiotic, nitrogen-fixing rhizobia from the betaproteobacteria class. For the α-rhizobia, the following genera were assessed because they have been found to contain the GA operon: Azorhizo- 51

Figure 6.1: Diversity among GA biosynthetic operons in divergent bacterial lineages. The core operon genes are defined as cyp112, cyp114, fd, sdr, cyp117, ggps, cps, and ks, as these are almost always present within the GA operon. Other genes, including cyp115, idi, and ids2, exhibit a more limited distribution among GA operon-containing species. The tandem diagonal lines in the Mesorhizobium loti MAFF303099 operon indicates that cyp115 is not located adjacent to the rest of the operon. bium, Bradyrhizobium, Ensifer/, Mesorhizobium, Microvirga, and Rhizobium (Nagel and Peters, 2017). The β-rhizobia was previously found to have the GA operon fall within the

Burkholderia and Paraburkholderia genera (Nagel et al., 2018). The different GA biosynthesis operons, or GA operons within the “α-rhizobia” can be found in figure 6.1.

6.1 Gene clusters analysis

While the GA operons found in gammaproteobacteria exhibit essentially uniform gene content and structural composition, those from the α-rhizobia exhibit much more diversity in genetic struc- 52

Input BLAST and Find homologous genes Trees Output

Species

Ancestral reconstruction of Retrieve and Gibberelin operon using annotated assembly Phylogenetic Tree files from NCBI Species Tree Species Tree Build a species tree (120 species) (64 species)

Reduce Tree using Reconstruct ancestral Gene Bank files PDA gene blocks using ROAGUE

Find homologous Concatenate multiple Operon Tree Operon Tree gene blocks to alignments and build (120 species) (64 species) Ancestral reconstruction of Gibberellin operon an operon tree Gibberelin operon using Genes in Operon Tree Gibberellin operon from Xanthomonas oryzae

Figure 6.2: Pipeline of the method to reconstruct the ancestral state of gibberellin operon ture. This suggests that selective pressures specific to the rhizobia, presumably their symbiotic relationship with legumes, may have driven this heterogeneity in the operon. To better understand the evolutionary history of the GA operon in the α-rhizobia, a more thorough analysis was carried out with ROAGUE (chapter5. ROAGUE generates a phylogenetic tree with selected taxa that contain gene blocks (i.e. gene clusters) of interest, and then uses a maximum parsimony approach to reconstruct a predicted gene block structure at each ancestral node of the tree. Using the

ROAGUE method, the evolutionary events involved in the genetic construction of orthologous GA operon gene blocks in the α-rhizobia, specifically gene loss, gain, and duplication, were quantita- tively assessed (see Figure 6.2 for a summary of the method pipeline). A total of 118 α-rhizobia with the GA operon were included in this analysis (see Table 6.2 for a list of the strains). The most phylogenetically distant GA operon to those in the α-rhizobia (lowest sequence similarities to the house keeper gene rpoB) is found within E. tracheiphila (Nagel and Peters, 2017), and as such, this was used as an outgroup. Additionally, to observe the relative relationship between alpha- and gamma- proteobacterial operons, the GA operon from Xanthomonas oryzae pv. oryzicola BLS256

(Xoc) was also included in the analysis.

An initial reconstruction was made by creating a species tree using the amino acid sequence of rpoB (RNA polymerase β subunit) from each strain as the phylogenetic marker gene (“full species tree” or FS) (Figure 6.6). However, based upon the predicted HGT of the operon, it 53 was reasoned that a species tree may not accurately depict GA operon evolution, as operon genes from distantly-related bacteria may be more similar than the relevant rpoB genes if the operon was acquired via HGT. Therefore, a second tree was created based on a concatenation of protein sequences comprising the core GA operon (“full operon tree” or FO) (Figure 6.7). Due to the large number of species being analyzed, along with apparent phylogenetic redundancy that could introduce bias, reconstructions were also made with only the more distinct representative strains by using the Phylogenetic Diversity Analyzer (PDA) program, which reduced the number of analyzed taxa to 64 (Chernomor et al., 2015). These reduced phylogenetic trees are referred to as the “partial species tree” or PS (Figure 6.3), and the “partial operon tree” or PO (Figure 6.4). In Figure 6.3 and Figure 6.4, the lower-case letters in each tree node represent the genes in the orthoblock (e.g.

“a” represents “cyp115 ”), with each gene additionally indicated by a unique color (see the legend at top of figure). A blank space between genes designates a split ≥500bp between the genes to either side of the blank space. The green bar on the top left of the figure displays the total number of events that occur in this reconstruction. For each inner node u, the floating number (e.g. 98.0) represents the bootstrap value of the tree. The numbers in the brackets indicate the cumulative count of events going from the leaf nodes to node u in the following order: [deletions, duplications, splits]. Each leaf node is accompanied by symbols (*, ?, !), the genomic accession number, the species/strain name, and the gene block for that strain. An asterisk (*) indicates the gene block contains full-length cyp115 (gene “a”); an exclamation (!) indicates that the gene block contains a truncation/fragment of cyp115, and a question mark (?) indicates the gene block contains ggps2

(gene “k”). The reference strain, Xanthomonas oryzae pv. oryzicola BLS256, is in blue, and the outgroup strain, Erwinia tracheiphila PSU-1, is in gray. These naming and color conventions persist through this study.

The ability of different ancestral reconstructions to capture the likely evolution of a gene cluster can be assessed by the number of events (loss, gain, and duplication) calculated by this method, with a lower number of events indicating a more parsimonious reconstruction. From this analysis, it was found that fewer evolutionary events are reconstructed in FO (75 events) than in FS (121 events) 54

Deletion count: 60 Duplication count: 0 Split count: 18 a : cyp115 b : cyp112 c : cyp114 d : fd e : sdr f : cyp117 g : ggps h : cps i : ks j : idi k : ggps2 p : cyp115-pseudo/fragment KE136308.1 Erwinia tracheiphila PSU-1 a b c d e f g h i j CP003057.2 Xanthomonas oryzae BLS256(reference) a b c d e f g h i j

b c d e f g h i AZYE01000987.1 Microvirga lupini Lut6 b d e f g h i j 25.0 AXBA01000001.1 Azorhizobium doebereinerae UFLA1-100 b c d e f g h i [2, 0, 1] !AUFA01000001.1 Bradyrhizobium sp. Cp5.3 p b c d e f g h i

b c d e f g h j !KI911783.1 Bradyrhizobium sp. WSM1417 b c d e f g h j p b c d e f g h i 97.0 !KB890498.1 Bradyrhizobium sp. WSM4349 p c d f g h j 99.0 [2, 0, 1] b c d e f g h i [12, 0, 3] 100.0 b c d e f g h i !AJQL01000001.1 Bradyrhizobium yuanmingense CCBAU 35157 e f g h i p [17, 0, 5] b c d e f g h i b c d e f g h i 50.0 !AJQE01000001.1 Bradyrhizobium sp. CCBAU 43298 p b c d e f g h i 86.0 b c d e f g h i 100.0 [3, 0, 0] [12, 0, 3] 48.0 [3, 0, 0] !AXAT01000001.1 Bradyrhizobium japonicum USDA 135 p b c d e f g h i [4, 0, 0] AXBC01000011.1 Bradyrhizobium genosp. SA-4 str. CB756 b c e f g h i a b c d e f g h i j b c d e f g h i ?AXAY01000001.1 Bradyrhizobium sp. WSM3983 b c d e f h i k 100.0 b c d e f g h i 70.0 b c d e f h i k [60, 0, 18] 100.0 [8, 0, 2] 11.0 k b c d e f h i !?AXAB01000001.1 Bradyrhizobium sp. WSM2254 k b c d e f h i p [15, 0, 4] [2, 0, 1] 58.0 !KB893805.1 Bradyrhizobium japonicum USDA 124 p b c d e f g h i b c d e f g h i [2, 0, 1] 18.0 [4, 0, 2] b c d e f g h i !KB902768.1 Bradyrhizobium sp. WSM2793 b c d e f g h i p 44.0 FMAI01000001.1 Bradyrhizobium sp. ERR11 b c d e f g h i [0, 0, 0] !AXAU01000001.1 Bradyrhizobium elkanii WSM1741 p b c d e f g h i b c d e f g h i !AUGA01000001.1 Bradyrhizobium sp. th.b2 b c d e f g h i p 98.0 b c d e f g h i !LFIP02000001.1 Bradyrhizobium embrapense SEMIA 6208 b c d e f g h i p [3, 0, 1] 96.0 b c d e f g h i [3, 0, 1] 21.0 b c d e f g h i !LJYE01000001.1 Bradyrhizobium pachyrhizi BR3262 p b c d e g h i [3, 0, 1] b c d e f g h i 25.0 !LFIQ01000001.1 Bradyrhizobium pachyrhizi PAC 48 p b c d e f g h i 5.0 [1, 0, 1] b c d e f g h i [3, 0, 1] !KB900701.1 Bradyrhizobium elkanii USDA 76 c e f g h i p 100.0 AHAM01000001.1 Mesorhizobium alhagi CCNWXJ12-2 b c d e f g h i j [58, 0, 18] ATYO01000001.1 Mesorhizobium sp. WSM3224 b c d e f g h i j b c d e f g h i j 100.0 a b c d e f g h i j *AZUX01000001.1 Mesorhizobium sp. WSM2561 a b c d e f g h i j [13, 0, 3] b c d e f g h i j 99.0 *LYTO01000001.1 Mesorhizobium sp. AA22 a b c d e f g h i j 100.0 [0, 0, 1] [13, 0, 3] AGSN01000001.1 Mesorhizobium amorphae CCNWGS0123 b c d e f g h i j a b c d e f g h i j b c d e f g h i j *CAAF010000001.1Mesorhizobium sp. STM 4661 a b c d e f g h i j 84.0 33.0 b c d e f g h i j [12, 0, 3] [2, 0, 0] 33.0 b c d e f g h i j AYVY01000001.1 Mesorhizobium sp. LSHC420B00 b c d e f g h j [2, 0, 0] 27.0 AYWZ01000001.1 Mesorhizobium sp. L2C066B000 b c d e f g h i j [1, 0, 0] a b c d e f g h i j *AXAL01000001.1 Mesorhizobium loti CJ3sym a b c d e f g h i j 50.0 a b c d e f g h i j [12, 0, 2] 30.0 b c d e f g h i j !AZUV01000001.1 Mesorhizobium sp. WSM1293 b c d e f g h i j p [4, 0, 1] 39.0 CP002279.1 Mesorhizobium opportunistum WSM2075 d f g h i j [3, 0, 1] a b c d e f g h i j 52.0 a i b c d e f g h j CP003358.1 Mesorhizobium australicum WSM2073 d f g h i j [9, 0, 1] 68.0 *AZUY01000001.1 Mesorhizobium sp. WSM3626 a b c d e f g h j a b c d e f g h i j [5, 0, 0] 30.0 *AXAE01000005.1 Mesorhizobium erdmanii USDA 3471 a b c d e f g h i j b c d e f g h i [5, 0, 0] a b c d e f g h i j 100.0 31.0 a b c d e f g h i j *LYTK01000001.1 Mesorhizobium loti NZP2042 b c d e f g h i j a [41, 0, 13] [0, 0, 0] 82.0 *AP003017.1 Mesorhizobium loti MAFF303099 a b c d e f g h i j [0, 0, 0] JFGO01000001.1 Sinorhizobium americanum CCGM7 b c d e f g h i d f h i j b c e g AJQT01000001.1 Ensifer sojae CCBAU 05684 d f h i j e 41.0 d f h i j e [8, 0, 2] 47.0 d f h i j e AJQN01000001.1 Sinorhizobium fredii CCBAU 25509 d f h i j e d f h i j b c e g [4, 0, 1] 100.0 CP000874.1 Sinorhizobium fredii NGR234 b c d e f g h i 41.0 [4, 0, 1] [15, 0, 4] !CP003563.1 Sinorhizobium fredii USDA 257 c d f h i j e p c d f h i j b e g *ATYB01000007.1 Sinorhizobium arboris LMG 14919 a b c d e f g i 42.0 a b c d e f g h i j b c d e f g h i [7, 0, 2] 84.0 a b c d e f g h i j AQWP01000001.1 Sinorhizobium meliloti 4H41 b d e f g h i j 100.0 [4, 0, 2] 79.0 *ATYC01000007.1 Sinorhizobium meliloti WSM4191 b c d e f g h i j a [17, 0, 5] [2, 0, 1]

b c d e f g h i AZNX01000001.1 Ensifer sp. TW10 b c d e f g h i 58.0 *AZUW01000001.1 Ensifer sp. WSM1721 b c d e f g h i a b c d e f g h i [1, 0, 0] 100.0 FMAF01000001.1 Rhizobium lusitanum P1-7 b c d e f g h i [27, 0, 10] b c d e f g h i AUFB01000001.1 Rhizobium leucaenae USDA 9039 b c d e f g h i 100.0 b c d e f g h i [0, 0, 0] 31.0 b c d e f g h i !AQHN01000001.1 Rhizobium freirei PRF 81 b c d e f g h i p [0, 0, 0] 50.0 !CP004015.1 Rhizobium tropici CIAT 899 p b c d e f g h i [0, 0, 0] b c d e f g h i 100.0 b c d e f g h i *ATTO01000001.1 Rhizobium favelukesii OR191 a b c d e f g h i j [10, 0, 5] b c d e f g h i 37.0 AEYE02000001.1 Rhizobium grahamii CCGE 502 b c d e f g h i 100.0 [2, 0, 0] [2, 0, 0] HF536772.1 Rhizobium mesoamericanum STM3625 b c d e f g h i b c d e f g h i *ATTQ01000001.1 Rhizobium mongolense USDA 1844 a b c d e f g h i j 99.0 [10, 0, 5] b c d e f h i j k ?CP000133.1 Rhizobium etli CFN 42 b c d e f h i j k b c d e f g h i j b c d e f h i j k 44.0 ?CP007641.1 Rhizobium etli bv. phaseoli str. IE4803 b e f h i j k 69.0 69.0 [2, 0, 2] [7, 0, 5] b c d e f h i j k [2, 0, 3] !?FMAJ01000001.1 Rhizobium sp. HBR26 p k b c d e f h i j 100.0 !?LFIO01000001.1 Rhizobium ecuadorense CNPSO 671 p b c d e f h i j k [4, 0, 4] b c d e f h i j k b c d e f h i j k !?AEYF01000001.1 Rhizobium sp. CCGE 510 b c d e f h i j k p 97.0 46.0 ?LJSR01000001.1 Rhizobium acidisoli FH23 b c d e f h i j k [2, 0, 1] b c d e f h i j k [0, 0, 0] 51.0 [2, 0, 1] b e f h i j k ?KB905373.1 Rhizobium leguminosarum bv. phaseoli 4292 b e f h i j k 55.0 ?ATTN01000001.1 Rhizobium leguminosarum bv. phaseoli FA23 b e f h i j k [0, 0, 0] 0.09

Figure 6.3: Reduced ancestral reconstruction of the GA biosynthetic operon using rpoB for phylo- genetic analysis. A phylogenetic tree was constructed with alignments of rpoB protein sequences from 118 α-rhizobia species and two gammaproteobacteria using the Neighbor-Joining method as a measure of the distance between species. The number of analyzed species was reduced to 64 with the Phylogenetic Diversity Analyzer software, and ROAGUE was then applied to create the ancestral operon reconstruction. Annotations are defined in section 6.1. 55

Deletion count: 48 Duplication count: 0 Split count: 14 a : cyp115 b : cyp112 c : cyp114 d : fd e : sdr f : cyp117 g : ggps h : cps i : ks j : idi k : ggps2 p : cyp115-pseudo/fragment KE136308.1 Erwinia tracheiphila PSU-1 a b c d e f g h i j

b c d e f g h i HF536772.1 Rhizobium mesoamericanum STM3625 b c d e f g h i 100.0 AEYE02000001.1 Rhizobium grahamii CCGE 502 b c d e f g h i [0, 0, 0] CP003057.2 Xanthomonas oryzae BLS256(reference) a b c d e f g h i j !?AEYF01000001.1 Rhizobium sp. CCGE 510 b c d e f h i j k p b c d e f h i j k 100.0 b c d e f h i j k !?FMAJ01000001.1 Rhizobium sp. HBR26 p k b c d e f h i j [0, 0, 1] 87.0 ?CP000133.1 Rhizobium etli CFN 42 b c d e f h i j k b c d e f g h i j b c d e f g h i j [0, 0, 1] 100.0 100.0 [48, 0, 14] [3, 0, 2] b c d e f g h i CP005950.1 Rhizobium etli bv. mimosae str. Mim1 b c d e f g h i b c d e f g h i j 100.0 !CP006986.1 Rhizobium etli bv. mimosae str. IE4771 p b c d e f g h i 40.0 [0, 0, 0] [6, 0, 3] AJUI01000023.1 Rhizobium leguminosarum bv. trifolii CC278f b c d e f g h j b c d e f g h i j b c d e f g h i j a b c d e f g h i j *ATYC01000007.1 Sinorhizobium meliloti WSM4191 b c d e f g h i j a 81.0 b c d e f g h i j 81.0 a b c d e f g h i j 15.0 ATYO01000001.1 Mesorhizobium sp. WSM3224 b c d e f g h i j [47, 0, 14] 96.0 [3, 0, 1] 76.0 [1, 0, 0] [8, 0, 4] [1, 0, 1] *ATTO01000001.1 Rhizobium favelukesii OR191 a b c d e f g h i j AXBC01000011.1 Bradyrhizobium genosp. SA-4 str. CB756 b c e f g h i b c d e f g h i !LJYF01000001.1 Bradyrhizobium yuanmingense BR3267 p b c e f g h i d 18.0 b c d e f g h i [1, 0, 1] 100.0 b c d e f g h i FMAI01000001.1 Bradyrhizobium sp. ERR11 b c d e f g h i [0, 0, 1] 53.0 !KB902768.1 Bradyrhizobium sp. WSM2793 b c d e f g h i p [0, 0, 0]

b c d e f g h i j c d e f g h !AJQL01000001.1 Bradyrhizobium yuanmingense CCBAU 35157 e f g h i p 100.0 c d e f g h j 100.0 !AXAZ01000001.1 Bradyrhizobium sp. WSM1743 c d e f g h p [46, 0, 14] 70.0 [3, 0, 1] [5, 0, 1] !KB890498.1 Bradyrhizobium sp. WSM4349 p c d f g h j c d e f g h i j !CP003563.1 Sinorhizobium fredii USDA 257 c d f h i j e p 73.0 d f h i j e [10, 0, 1] 100.0 d f h i j e HE616890.1 Sinorhizobium fredii HH103 d f h i j e c d e f g h i j d e f g h i j [1, 0, 0] 100.0 AJQT01000001.1 Ensifer sojae CCBAU 05684 d f h i j e 100.0 70.0 [0, 0, 0] [12, 0, 2] [3, 0, 0] CP002447.1 Mesorhizobium ciceri biovar biserrulae WSM1271 d f g h i j !KB900701.1 Bradyrhizobium elkanii USDA 76 c e f g h i p b c d e f g h i j AXBA01000001.1 Azorhizobium doebereinerae UFLA1-100 b c d e f g h i 44.0 KI911771.1 Rhizobium leguminosarum bv. trifolii CC283b b c d e f g h j [38, 0, 10] b c d e f g h i !LJYE01000001.1 Bradyrhizobium pachyrhizi BR3262 p b c d e g h i b c d e f g h i j 32.0 !LFIQ01000001.1 Bradyrhizobium pachyrhizi PAC 48 p b c d e f g h i 52.0 [1, 0, 1] [25, 0, 8] b c d e f g h i 10.0 b c d e f g h i !LFIP02000001.1 Bradyrhizobium embrapense SEMIA 6208 b c d e f g h i p [1, 0, 1] 56.0 !AXAU01000001.1 Bradyrhizobium elkanii WSM1741 p b c d e f g h i b c d e f g h i j b c d e f g h i [0, 0, 0] 46.0 9.0 !AXAT01000001.1 Bradyrhizobium japonicum USDA 135 p b c d e f g h i [24, 0, 8] [0, 0, 0] b c d e f g h i 100.0 b c d e f g h i !BA000040.2 Bradyrhizobium diazoefficiens USDA 110 p b c d e f g h i [0, 0, 0] 88.0 !AP014659.1 Bradyrhizobium diazoefficiens NK6 b c d e f g h i p [0, 0, 0] !CP015064.1 Mesorhizobium ciceri biovar biserrulae WSM1284 b c d e f g h i j p AUFB01000001.1 Rhizobium leucaenae USDA 9039 b c d e f g h i b c d e f g h i j b c d e f g h i 8.0 96.0 b c d e f g h i CP006877.1 Rhizobium gallicum bv. gallicum R602 b c d e f g h i [23, 0, 8] [2, 0, 1] 94.0 AZYE01000987.1 Microvirga lupini Lut6 b d e f g h i j b c d e f g h i [2, 0, 1] 60.0 [4, 0, 2] b c d e f g h i j b c d e f g h i JFGO01000001.1 Sinorhizobium americanum CCGM7 b c d e f g h i 73.0 96.0 AQWP01000001.1 Sinorhizobium meliloti 4H41 b d e f g h i j [6, 0, 2] [2, 0, 1] AYVY01000001.1 Mesorhizobium sp. LSHC420B00 b c d e f g h j

b c d e f g h i j AYWZ01000001.1 Mesorhizobium sp. L2C066B000 b c d e f g h i j b c d e f g h i j b c d e f g h i j 90.0 AYVO01000001.1 Mesorhizobium sp. LSJC268A00 b c d e f g h i j 21.0 53.0 [0, 0, 0] b c d e f g h i j [21, 0, 7] b c d e f g h i j [0, 0, 0] AYWN01000001.1 Mesorhizobium sp. LNJC372A00 b c d e f g h i j 23.0 84.0 [9, 0, 3] b c d e f g h i j [0, 0, 0] b c d e f g h i j AYXF01000001.1 Mesorhizobium sp. L103C105A0 b c d e f g h i j 100.0 68.0 AYVX01000001.1 Mesorhizobium sp. LSHC422A00 b c d e f g h i j b c d e f g h i j [0, 0, 0] b c d e f g h i j [0, 0, 0] a b c d e f g h i j 83.0 AYVP01000001.1 Mesorhizobium sp. LSJC265A00 b c d e f g h i j 29.0 a b c d e f g h i j 47.0 [0, 0, 0] AYWU01000001.1 Mesorhizobium sp. L48C026A00 b c d e f g h i j [11, 0, 3] b c d e f g h i j 80.0 [1, 0, 1] *AP003017.1 Mesorhizobium loti MAFF303099 a b c d e f g h i j 39.0 [1, 0, 1] *ATTQ01000001.1 Rhizobium mongolense USDA 1844 a b c d e f g h i j [3, 0, 1] !JH600072.1 Bradyrhizobium sp. WSM1253 b c d e f g h i p *AZUY01000001.1 Mesorhizobium sp. WSM3626 a b c d e f g h j b c d e f g h i j 13.0 b c d e f g h i !KI911783.1 Bradyrhizobium sp. WSM1417 b c d e f g h j p [21, 0, 7] 24.0 !AUFA01000001.1 Bradyrhizobium sp. Cp5.3 p b c d e f g h i b c d e f g h i [2, 0, 0] 13.0 [4, 0, 0] k b c d e f h i !?AXAB01000001.1Bradyrhizobium sp. WSM2254 k b c d e f h i p 90.0 ?AXAY01000001.1 Bradyrhizobium sp. WSM3983 b c d e f h i k b c d e f g h i j [0, 0, 0] 31.0 *LYTO01000001.1 Mesorhizobium sp. AA22 a b c d e f g h i j [10, 0, 4] b c d e f g h i h i LATE01000001.1 Sinorhizobium sp. PC2 b c d e f g h i h i 57.0 CP000874.1 Sinorhizobium fredii NGR234 b c d e f g h i b c d e f g h i j b c d e f g h i h i [0, 0, 1] 4.0 28.0 [5, 0, 4] [2, 0, 3] b c d e f g h i a *AZUW01000001.1Ensifer sp. WSM1721 b c d e f g h i a 99.0 *ATYB01000007.1 Sinorhizobium arboris LMG 14919 a b c d e f g i b c d e f g h i j [1, 0, 1] 4.0 [4, 0, 4] b c d e f g h i j !AUGA01000001.1 Bradyrhizobium sp. th.b2 b c d e f g h i p b c d e f g h i j 33.0 AGSN01000001.1 Mesorhizobium amorphae CCNWGS0123 b c d e f g h i j 4.0 [1, 0, 0] [1, 0, 0] AHAM01000001.1 Mesorhizobium alhagi CCNWXJ12-2 b c d e f g h i j 0.11

Figure 6.4: Reduced ancestral reconstruction of the GA biosynthetic operon using the concate- nated operon for phylogenetic analysis. A phylogenetic tree was constructed with alignments of concatenated proteins from core GA operon genes (cyp112-cyp114-fd-sdr-cyp117-cps-ks) from 118 α-rhizobia species and two gammaproteobacteria using the Neighbor-Joining method. The num- ber of analyzed species was reduced to 64 with the Phylogenetic Diversity Analyzer software, and ROAGUE was then applied to create the ancestral operon reconstruction. and ROAGUE was then applied to create the ancestral operon reconstruction. Annotations are defined in section 6.1. 56

(Figures 6.6 and 6.7), with the same relative trend observed with the partial trees (62 events for

PO vs. 78 events for PS) (Figures 6.3 and 6.4). The greater parsimony observed in reconstructions built with alignments of the concatenated GA operon strongly supports the previously suggested hypothesis of HGT among α-rhizobia (Hershey et al., 2014). Accordingly, the reconstructions based on GA operon similarity (i.e. FO and PO) were used for further analyses of operon inheritance.

In contrast to the phytopathogens, a full-length cyp115 gene is absent from the genomes of most rhizobia (including both α- and β- rhizobia). Instead, α- and β- rhizobia typically have only the core operon and, hence, can only produce the penultimate intermediate GA9 rather than bioactive

GA4 (M´endezet al., 2014; Nagel and Peters, 2017; Nett et al., 2017a). ROAGUE analysis indicates that cyp115 loss occurred soon after α-rhizobia acquisition of the GA operon, as the reconstructed ancestral node that connects the α-rhizobia to Xoc (and the rest of the gammaproteobacteria) does not contain cyp115 (Figure 6.4). Although the α-rhizobia presumably acquired their GA operon from a gammaproteobacterial ancestor, the gammaproteobacteria seem to always have cyp115 at the 5’ end of the operon. In contrast, the α-rhizobia typically only has a partial cyp115 pseudo- gene/fragment located at this position, as previously described (Tully et al., 1998; Nett et al.,

2017a). This suggests that the original operon acquired by an α-rhizobia ancestor contained cyp115, and that this gene was subsequently lost. Interestingly, it has been previously reported that cyp115 also is absent from the GA operons of β-rhizobia, which seem to have independently gained their operon from a gammaproteobacterial progenitor (Nagel et al., 2018).

Although cyp115 is absent in most α-rhizobia, a subset of α-rhizobia (< 20%) with the GA operon also has a full-length, functional cyp115. However, only in one strain (Mesorhizobium sp.

AA22) does the GA operon have cyp115 in the same location as in gammaproteobacterial GA operons (Nett et al., 2017a). Strikingly, ROAGUE analysis indicates that full-length cyp115 has been regained independently in at least three different lineages, which is apparent in either the

PS or PO reconstructions (Figures 6.3 and 6.4). Indeed, other than in Mesorhizobium sp. AA22, these full-length cyp115 reside in alternative locations relative to the rest of the GA operon (e.g. 3’ 57 end of the operon, or distally located), as previously described (Nett et al., 2017a), which further supports the independent acquisition of cyp115 via an additional HGT event.

Similar to cyp115, a full-length idi gene is generally present in the GA operons of gammapro- teobacteria, and is only sporadically present in the GA operons of α-rhizobia. Collectively, 62 of the

118 α-rhizobia strains analyzed here possess this gene, which seems to invariably exhibit analogous positioning – i.e. at the 3’ end of the operon, as found in the gammaproteobacterial GA operons.

Our ROAGUE analysis indicates that the ancestral strain with the GA operon likely possessed idi, and that this gene has subsequently been lost in many α-rhizobia strains. However, there are no- table differences among losses of this gene within the major α-rhizobia genera. For example, while the presence of idi appears to be almost random within Rhizobium (16/26 strains) and Sinorhizo- bium/Ensifer (8/14), it is nearly absent from all Bradyrhizobium (2/40), but ubiquitously found in Mesorhizobium (36/36).

Not surprisingly, ggps2 seems to be invariably associated with operons in which the canonical ggps is inactive (Figures 6.3 and 6.4), and is only found in 13 out of the 118 α-rhizobia analyzed in this study. However, the ancestral reconstructions further indicate that ggps2 is present in two distinct clades in all trees; one composed of closely related Rhizobium strains, and another with two

Bradyrhizobium strains. While the Rhizobium all have homologous mutations in ggps, with similar positioning of ggps2 (within 500 bp of the 3’ end of the operon), the two Bradyrhizobium have distinct ggps mutations, with ggps2 positioned on opposite sides of the operon. There is higher homology between the GGPS2 proteins within each of the two clades than between them (Table

6.1), suggesting that each acquired ggps2 independently. However, the lack of synteny between the two Bradyrhizobium strains, as well as distinct lesions in ggps, suggests that these each may have separately acquired ggps2 as well. Given that the GGPS2 proteins are all much more closely related to each other (>83% amino acid sequence identity) than to any other homologs (<45% identity), ggps2 appears to have undergone HGT within the α-rhizobia following the initial acquisition by a GA operon in which ggps was lost/inactivated. Nevertheless, acquisition of ggps2 was clearly 58

Figure 6.5: Summary of ancestral reconstruction for the GA biosynthetic operon. As a repre- sentation of GA operon evolution, the results from the full reconstruction generated using the concatenated operon (FO) are summarized here. The initial loss of the cyp115 gene is indicated with a red arrow, while the reacquisition of this gene is indicated with a green circle at the ances- tral node. The loss of ggps and acquisition of ggps2 is indicated by a blue box at the ancestral node. Brackets around a gene represent a variable presence within that lineage. Double slanted lines indicate genes that are not located within the cluster (i.e. >500 bp away). The family of proteobacterial lineages is indicated to the right of the figure (α and γ labels). 59 followed by vertical transmission of the modified GA operon in the case of the larger and more homologous Rhizobium containing clade.

In addition to ancestral gene loss and gain events, there also have been fusions between con- stituent biosynthetic genes within the GA operon. In some α-rhizobia, the fdGA gene, which is usually a distinct coding sequence, is found in-frame with either the 5’ proximal cyp114 gene or the 3’ proximal sdrGA gene, resulting in either cyp114-fd or fd-sdr fusions, which presumably act as bifunctional polypeptides. As fusion events are not factored into ROAGUE, these were assessed and categorized manually (Tables 6.3 and 6.4). The cyp114-fd fusion is only found in a single clade consisting of almost entirely Rhizobium species, which is most evident in the FO reconstruction

(Figure 6.7). By contrast, while the fd-sdr fusion is largely found in a clade consisting of mostly

Mesorhizobium species (Figure 6.7), such fusion appears to have independently occurred in another clade as well. This fusion seems to be functional, as the activity of the fused Fd-SDR enzyme in

M. loti MAFF303099 has been biochemically verified (Tatsukami and Ueda, 2016). Beyond these multiple observations in α-rhizobia, it should be noted that a functional fd-sdr fusion appears to have independently arisen in the β-rhizobia as well (Nagel et al., 2018), indicating that this is not functionally problematic. While the functionality of the CYP114-Fd fusion has not been demon- strated to date, the cooperative activity of the two encoded enzymes (Nett et al., 2017b) supports their ability to be functionally incorporated into a single polypeptide (Marcotte et al., 1999).

6.2 Results and Discussion

Collectively, our analyses demonstrate a complex history of GA operon function, distribu- tion, and evolution within the proteobacteria (Figure 6.5). The ROAGUE analysis reported here are consistent with the hypothesis that the GA operon has undergone HGT between var- ious plant-associated bacteria, including phytopathogenic gammaproteobacteria and symbiotic, nitrogen-fixing α- and β- rhizobia. As an added layer of complexity, it is generally accepted that the large symbiotic or pathogenic genomic islands or plasmids (i.e. symbiotic or pathogenic mod- ules), which enable the plant-associated lifestyle of these bacteria, are capable of undergoing HGT 60

(Dobrindt et al., 2004). For α-rhizobia strains where sufficient genomic information is available, the GA operon is invariably found within the symbiotic module Sullivan et al. (2002); Freiberg et al. (1997); Gonz´alezet al. (2003); G¨ottfertet al. (2001); Kaneko et al. (2002); Uchiumi et al.

(2004). Interestingly, HGT of the GA operon independently of the symbiotic module has been previously suggested based on phylogenetic incongruences between genes representative of species

(16S rRNA), symbiotic module (nifK ), and GA operon (cps) similarity (Hershey et al., 2014).

Thus, there appear to be multiple levels of HGT with the GA operon in α-rhizobia: 1) acquisition of the symbiotic module (i.e. symbiotic plasmid or genomic island), either with or without the GA operon, 2) separate acquisition of the GA operon within the symbiotic module, and 3) subsequent acquisition of auxiliary genes, including ggps2 and cyp115.

Although widespread within proteobacteria, the GA operon has thus far only been found in plant-associated species (Levy et al., 2018). While this is not surprising due to the function of

GA as a phytohormone, it emphasizes that such manipulation of host plants is an effective mech- anism for bacteria to gain selective advantage. Indeed, the ability to produce GA seems to be a powerful method of host manipulation for plant-associated microbes more generally, as certain phytopathogenic fungi also have convergently evolved the ability to produce GA as a virulence factor (Wiemann et al., 2013; Hedden and Sponsel, 2015).

Despite the wide-ranging HGT of the GA operon between disparate classes of proteobacteria, its scattered distribution within each of these classes strongly indicates that the ability to produce

GA only provides a selective advantage under certain conditions. This is evident for both symbiotic rhizobia and bacterial phytopathogens. For example, Xoc, the causal agent of bacterial leaf streak in rice, produces GA as a virulence factor to suppress the plant jasmonic acid (JA) induced defense response (Lu et al., 2015; Navarro et al., 2008; Hou et al., 2010). By contrast, the production of

GA by the α-rhizobia M. loti MAFF303099 limits the formation of additional nodules, apparently without having a negative impact on plant growth 4.

The occurrence of GA operon fragments (i.e. presence of some, but not all necessary biosyn- thetic operon genes) in many rhizobia would seem to indicate that the production of GA is not 61 advantageous in all rhizobia-legume symbioses. For example, at the onset of this study, we identified

>160 α-rhizobia with an apparent homolog of at least one GA operon gene, yet only 120 of these contained a gene cluster (i.e. two or more biosynthetic genes clustered together), and many of these clusters are clearly non-functional due to the absence of crucial biosynthetic genes. It has been suggested that the GA operon is associated with species that inhabit determinate nodules (Hershey et al., 2014), as these nodules grow via cell expansion (an activity commonly associated with GA signaling (Schwechheimer, 2012), and not indeterminate nodules, which grow via continuous cell division (Oldroyd et al., 2011). However, while the presence of the GA operon does seem to be somewhat enriched within rhizobia that associate with determinate nodule-forming legumes (Ta- ble 6.2), there are many examples of rhizobia with complete GA operons that were isolated from indeterminate nodules. For example, while most GA operon-containing Bradyrhizobium species associate with determinate nodule-forming plants, many species from the Ensifer/Sinorhizobium,

Mesorhizobium, and Rhizobium genera with the operon were isolated from indeterminate nodules, as were all three of the β-rhizobia with the GA operon (Integrated Microbe Genomes, JGI). Addi- tionally, many rhizobia, 368 including some with the GA operon such as S. fredii NGR234 (Pueppke and Broughton, 1999), are capable of symbiosis with either type of legume – i.e., those forming either determinant or indeterminate nodules. These inconsistencies raise the question of why only some rhizobia have acquired and maintained the GA operon, and thus the capacity to produce GA.

In addition to its scattered distribution, the operon exhibits notable genetic diversity within the

α-rhizobia. For example, ROAGUE analysis indicates that loss of the usual ggps and subsequent recruitment of ggps2 has been followed by HGT of this to other operons in which ggps has been inactivated. While it is difficult to infer the original source of ggps2, it is of note that the closest non-GA operon homologs are found clustered together with genes related to photosynthesis (e.g. in the alphaproteobacteria Hyphomicrobium sp. ghe19). Though this specific gene has not been functionally characterized, it seems reasonable that these homologs may be involved in producing

GGPP as a precursor of photosynthetic pigments – i.e. phytol and/or carotenoids (Zi et al., 2014). 62

While the loss/gain of GGPP-synthase genes represents rather dramatic events in GA operon evolution, other modifications to the operon may have subtle yet informative effects upon GA production. In particular, although the loss of idi (and potentially the observed gene fusions events) may reduce the rate of GA production, it appears that this can be easily accommodated in certain rhizobia-legume pairings. Indeed, the expression of the GA operon is delayed in this symbiotic relationship (M´endez et al., 2014), perhaps to mitigate any deleterious effects of GA during early nodule formation, which has been shown to be inhibitory to nodule formation, at least at higher concentrations Hayashi et al. (2014).

Perhaps the most striking evolutionary aspect of the rhizobial GA operons is the early loss and scattered re-acquisition of cyp115 in α-rhizobia. While almost all operon-containing α-rhizobia contains remnants of cyp115 at the position in the GA operon where it is found in gammapro- teobacteria phytopathogens (Nett et al., 2017a), there is only one strain (Mesorhizobium sp. AA22) where a full-length copy is found in this location. Phylogenetic analysis further suggests that cyp115 from Mesorhizobium sp. AA22 is closest to the ancestor of all the full-length copies found in α- rhizobia, which are all found at varied locations relative to the GA operon (Nett et al., 2017a).

The ROAGUE analysis reported here indicates that cyp115 was lost shortly after the acquisition of the ancestral GA operon by α-rhizobia, despite full-length copies being present in several different lineages. Accordingly, these results support the hypothesis that cyp115 has been re-acquired by this subset of rhizobia via independent HGT events. Notably, while not recognized in the original report (Tatsukami and Ueda, 2016), this includes M. loti MAFF303099, the only strain in which the biological role of rhizobial production of GA has been examined. Because cyp115 is required for endogenous production of bioactive GA4 from the penultimate (inactive) precursor GA9, this pro- vides additional intrigue regarding the selective pressures driving the evolution of GA biosynthesis in rhizobia.

The contrast between GA operon-containing bacterial lineages provides a captivating rationale for the further scattered distribution of cyp115 in rhizobia. In particular, the phytopathogens all contain cyp115 and are thus capable of direct production of bioactive GA4, which serves to suppress 63 the JA-induced plant defense response (Lu et al., 2015). Extension of this observation leads to the hypothesis that rhizobial production of GA4 might negatively impact the ability of the host plant to defend against microbial pathogens invading the roots or root nodules, which would compromise the efficacy of this symbiotic interaction. As such, the detrimental effect of rhizobial production of bioactive GA4 may have driven loss of cyp115. However, this would also result in a loss of GA signaling, as GA9, the product of an operon missing cyp115, presumably does not exert hormonal activity (Ueguchi-Tanaka et al., 2005). One possible mechanism to compensate for cyp115 loss would be legume host expression of the functionally-equivalent plant GA 3-oxidase (GA3ox) gene

(from endogenous plant GA metabolism) within the nodules in which the rhizobia reside. The expression of this plant gene could alleviate the necessity for rhizobial symbiont maintenance of cyp115, and would further allow the host to control the production of bioactive GA4, and thereby mount an effective defense response when necessary. Re-acquisition of cyp115 could then be driven by a lack of such GA3ox expression in nodules by certain legumes. However, this scenario remains hypothetical - though precise GA production by the plant has been shown to be critical for normal nodulation to occur (Hayashi et al., 2014; McAdam et al., 2018), coordinated biosynthesis of GA4 by rhizobia and the legume host would need to be demonstrated. This includes both the transport of

GA9 from microbe to host plant, as well as the subsequent conversion of this precursor to a bioactive

GA (e.g. GA4). Accordingly, continued study of the GA operon will provide insight into the various roles played by bacterially-produced GA in both symbiotic rhizobia-legume relationships, as well as antagonistic plant-pathogen interactions, which in turn can be expected to provide fundamental knowledge regarding the ever-expanding roles of GA signaling in plants.

General methodology and source code

A summary of the methods used herein can be found in Figure 6.2. All code and scripts used for analysis within this manuscript, as well as a general workflow for the use of the ROAGUE method in the ancestral reconstruction of gene blocks, can be found at the following GitHub repositories:

https://github.com/nguyenngochuy91/Gibberellin-Operon 64

https://github.com/nguyenngochuy91/Ancestral-Blocks-Reconstruction

Identifying orthologous gene blocks

Using Xanthomonas oryzae pv. oryzicola BLS256 (Xoc) as a reference taxon, we retrieved the

10 genes in the GA operon (cyp115, cyp112, cyp114, fdGA, sdrGA, cyp117, ggps, cps, ks, and idi). From those 10 genes, we determined whether a query strain contains orthologous gene blocks.

Using Xanthomonas oryzae pv. oryzicola BLS256 (Xoc) as a reference taxon, we retrieved the

10 genes in the GA operon (cyp115, cyp112, cyp114, fd GA , sdr GAGA , cyp117, ggps, cps, ks, and idi), for which all genes or their orthologs (except idi) have been previously experimentally validated (Lu et al., 2015; Hershey et al., 2014; Nagel et al., 2017). From those 10 genes, we determined whether a query strain contains orthologous gene blocks via BLAST searches against the NCBI non-redundant nucleotide database. Sequences were confirmed as orthologs if the BLAST e-value was 10 -10 or less. Initial BLAST analysis (April 12, 2017) revealed 166 bacterial strains within the alphaproteobacteria class (specifically: Azorhizobium, Bradyrhizobium, Mesorhizobium,

Microvirga, Rhizobium, and Ensifer/Sinorhizobium species) that contained orthologs of one or more of the GA operon genes. Given a set of 166 species/strain names, the corresponding genome assembly files were retrieved from the NCBI website. Using their assembly summary.txt file, the strains’ genomic fna (fasta nucleic acid) files were downloaded. The number of strains analyzed was further reduced by only including strains with multiple GA operon genes (>2) clustered together, resulting in a final total of 118 strains (listed in Supplemental Table 3). Retrieved genome assemblies for these strains were then annotated using Prokka (Seemann, 2014).

Identification of pseudo-cyp115 sequences

Though cyp115 is found as a full-length gene in the majority of gammaproteobacterial GA operons, as well as in some alphaproteobacterial operons, it exists as a truncated open reading frame, or gene fragment, in most alphaproteobacteria. Previous assessment of cyp115 gene fragments was performed through manual assessment of the genomic sequence 5’ to cyp112 in the GA operon 65

(Nett et al., 2017a). To identify these gene fragments (pseudo-cyp115, or p-cyp115 ) in a more streamlined process, BLAST searches were performed with the Xoc cyp115, as described above, to identify sequences with a BLAST e-value less than 1e-10 . If the length of the identified sequence was less than 60% that of the query gene and share greater than 50% sequence identity, then the sequence was annotated as p-cyp115.

Computational Reconstruction of the Gibberellin Operon Phylogeny

ROAGUE software was used to reconstruct ancestral gene blocks. Previous phylogenetic analy- sis demonstrated incongruences between phylogenetic trees constructed with a species marker (16S rRNA), a symbiotic module marker (nifK ), and GA operon genes (Hershey et al., 2014). This suggests independent horizontal gene transfer (HGT) of both the symbiotic module and the GA operon. In an effort to objectively and thoroughly assess the possibility of HGT of the GA operon among alphaproteobacteria rhizobia, phylogenetic trees were constructed using alignments from both a species marker gene (rpoB) and from a concatenation of the protein sequences for genes in the GA operon. The species tree (S) was generated by aligning rpoB (species marker) protein sequences via the MUSCLE algorithm (https://www.ebi.ac.uk/Tools/msa/muscle/), and then using the Neighbor-Joining method to generate the tree. For the operon tree (FO,PO), the protein sequences of open reading frames (ORFs) of the orthoblock genes for each species were naively concatenated. Because many species lack a full-length cyp115 and idi gene, and due to the loss of ggps in several species, only the following genes, which seem to be more uniformly conserved, were concatenated for this purpose: cyp112, cyp114, fdGA , sdrGA , cyp117, cps, and ks. Multiple sequence alignment of these concatenations was performed using the MUSCLE algorithm, and the

Neighbor-Joining method was used to build the presented trees. Erwinia tracheiphila PSU-1 was used as the phylogenetic outgroup in both trees, as this is distant phylogenetically and has been shown to have the most distant GA operon to that of the alphaproteobacteria (Nagel and Peters,

2017). ROAGUE analysis (Nguyen et al., 2019) was then applied to each phylogenetic reconstruc- tion. Each leaf node v in FS, and PS contains orthologs to the genes found in the GA operon of 66 the reference species (Xoc). For any two genes a and b, if the chromosomal distance is less than

500 bp, the genes will be written as ab. If the distance is greater than 500 bp, they are written with the separator character, thus written as a—b.

Due to the large number (118 strains) and redundancy (both phylogenetically and in operon structure) of the alphaproteobacteria strains, the size of the tree was reduced by using the Phyloge- netic Diversity Analyzer software (Chernomor et al., 2015). To facilitate analysis and presentation, the number of species was reduced to 64, which was still representative of the overall diversity, and enables ready visualization. Additionally, this software only keeps species that have a sufficiently unique sequence identity to give distinct branches on the phylogenetic tree (i.e., reflecting apprecia- ble distance between species). Accordingly, this approach also eliminated redundancy that would otherwise have confounded this analysis. The full operon tree, full species tree, partial operon tree, and partial species tree are referred to as FO, FS, PO, PS trees, respectively. The topology for each of these reconstructions was then compared in order to identify major incongruences that may indicate HGT.

Identification of gene fusion events

Currently, ROAGUE does not account for gene fusion. Given the presence of several GA operons wherein the fdGA gene is fused in frame with either the sdrGA gene or the cyp114 gene, it was necessary to assess these manually. This was accomplished by rechecking the gene block to determine if the fdGA gene was missing, as the fusion of this gene in frame with either sdrGA or cyp114 would result in it not being recognized as a unique ORF by the ROAGUE software. Then, the BLAST results were queried, and potential fusions were found by checking the start and end of the alignments for the subject genome. Strains with cyp114-fd fusions are listed in Table 6.3, and strains with fd-sdr fusions are listed in Table 6.4. 67

Deletion count: 85 Duplication count: 0 Split count: 36 a : cyp115 b : cyp112 c : cyp114 d : fd e : sdr f : cyp117 g : ggps h : cps i : ks j : idi k : ggps2 p : cyp115-pseudo/fragment KE136308.1 Erwinia tracheiphila PSU-1 a b c d e f g h i j CP003057.2 Xanthomonas oryzae BLS256(reference) a b c d e f g h i j AZYE01000987.1 Microvirga lupini Lut6 b d e f g h i j AXBA01000001.1 Azorhizobium doebereinerae UFLA1-100 b c d e f g h i !AUFA01000001.1 Bradyrhizobium sp. Cp5.3 p b c d e f g h i !KB890498.1 Bradyrhizobium sp. WSM4349 p c d f g h j b c d e f g h j !KI911783.1 Bradyrhizobium sp. WSM1417 b c d e f g h j p 99.0 b c d e f g h j [4, 0, 1] 81.0 b c d e f g h i !CM001442.1 Bradyrhizobium sp. WSM471 b c d e f g h i p b c d e f g h i j [2, 0, 0] 72.0 !JH600072.1 Bradyrhizobium sp. WSM1253 b c d e f g h i p b c d e f g h i 100.0 [0, 0, 0] [21, 0, 6] 100.0 [16, 0, 4] b c d e f g h i !AXAT01000001.1 Bradyrhizobium japonicum USDA 135 p b c d e f g h i 100.0 !AXAZ01000001.1 Bradyrhizobium sp. WSM1743 c d e f g h p [2, 0, 1] b c d e f g h i 100.0 b c d e f g h i b c d e f g h i !LJYF01000001.1 Bradyrhizobium yuanmingense BR3267 p b c e f g h i d b c d e f g h i [5, 0, 2] 66.0 b c d e f g h i 43.0 !AJQL01000001.1 Bradyrhizobium yuanmingense CCBAU 35157 e f g h i p 87.0 [16, 0, 4] b c d e f g h i [19, 0, 5] 36.0 [3, 0, 1] 39.0 [3, 0, 1] !AJQE01000001.1 Bradyrhizobium sp. CCBAU 43298 p b c d e f g h i [6, 0, 2] b c d e f g h i AXBC01000011.1 Bradyrhizobium genosp. SA-4 str. CB756 b c e f g h i 100.0 AXAD01000001.1 Bradyrhizobium sp. USDA 3384 b c d e f g h i [1, 0, 0] FMAI01000001.1 Bradyrhizobium sp. ERR11 b c d e f g h i b c d e f g h i b c d e f g h i !KB902768.1 Bradyrhizobium sp. WSM2793 b c d e f g h i p 59.0 54.0 b c d e f g h i [10, 0, 3] [0, 0, 0] 58.0 b c d e f g h i !AJQH01000001.1 Bradyrhizobium sp. CCBAU 15635 b c d e f g h i p [0, 0, 0] b c d e f g h i 87.0 !AJQG01000001.1 Bradyrhizobium sp. CCBAU 15615 p b c d e f g h i a b c d e f g h i j 87.0 [0, 0, 0] 100.0 b c d e f g h i [0, 0, 0] !AXAF01000003.1 Bradyrhizobium japonicum USDA 4 b c d e f g h i p [85, 0, 36] 100.0 ?AXAY01000001.1 Bradyrhizobium sp. WSM3983 b c d e f h i k [19, 0, 5] b c d e f g h i b c d e f g h i JRPN01000001.1 Bradyrhizobium japonicum Is-34 b c d e f g h i 23.0 b c d e f g h i b c d e f g h i 87.0 !CP007569.1 Bradyrhizobium japonicum SEMIA 5079 p b c d e f g h i [4, 0, 1] 84.0 b c d e f g h i 86.0 [0, 0, 0] [2, 0, 1] b c d e f g h i 86.0 [0, 0, 0] !JGCL01000001.1 Bradyrhizobium japonicum FN1 p b c d e f g h i b c d e f g h i 86.0 [0, 0, 0] !CP010313.1 Bradyrhizobium japonicum E109 p b c d e f g h i b c d e f g h i 89.0 [0, 0, 0] AXAG01000003.1 Bradyrhizobium japonicum USDA 38 b c d e f g h i 62.0 [0, 0, 0] !AP012206.1 Bradyrhizobium japonicum USDA 6 p b c d e f g h i [0, 0, 0] !AXVP01000001.1 Bradyrhizobium japonicum USDA 123 p b c d e f g h i b c d e f g h i 12.0 b c d e f g h i !LGUJ01000001.1 Bradyrhizobium japonicum Is-1 p b c d e f g h i [4, 0, 1] b c d e f g h i 98.0 !AP014659.1 Bradyrhizobium diazoefficiens NK6 b c d e f g h i p b c d e f g h i 98.0 [0, 0, 0] b c d e f g h i 98.0 [0, 0, 0] !BA000040.2 Bradyrhizobium diazoefficiens USDA 110 p b c d e f g h i b c d e f g h i 98.0 [0, 0, 0] !AXAX01000001.1 Bradyrhizobium japonicum USDA 122 b c d e f g h i p b c d e f g h i 100.0 [2, 0, 0] !?AXAB01000001.1 Bradyrhizobium sp. WSM2254 k b c d e f h i p 64.0 [2, 0, 0] !ADOU02000001.1 Bradyrhizobium diazoefficiens SEMIA 5080 b c d e f g h i p [2, 0, 0] !KB893805.1 Bradyrhizobium japonicum USDA 124 p b c d e f g h i !AXAU01000001.1 Bradyrhizobium elkanii WSM1741 p b c d e f g h i b c d e f g h i !AUGA01000001.1 Bradyrhizobium sp. th.b2 b c d e f g h i p 98.0 b c d e f g h i !LFIP02000001.1 Bradyrhizobium embrapense SEMIA 6208 b c d e f g h i p [3, 0, 1] 96.0 !LJYE01000001.1 Bradyrhizobium pachyrhizi BR3262 p b c d e g h i [3, 0, 1] b c d e f g h i b c d e f g h i 26.0 31.0 b c d e f g h i !LFIQ01000001.1 Bradyrhizobium pachyrhizi PAC 48 p b c d e f g h i [3, 0, 1] b c d e f g h i [1, 0, 1] b c d e f g h i 92.0 !AXAH01000003.1 Bradyrhizobium elkanii USDA 3254 p b c d e f g h i b c d e f g h i j 10.0 92.0 [0, 0, 0] 100.0 [3, 0, 1] [0, 0, 0] !AXAW01000001.1 Bradyrhizobium elkanii USDA 3259 b c d e f g h i p [84, 0, 36] !KB900701.1 Bradyrhizobium elkanii USDA 76 c e f g h i p JFGO01000001.1 Sinorhizobium americanum CCGM7 b c d e f g h i CP000874.1 Sinorhizobium fredii NGR234 b c d e f g h i c d f h i j b e g d f h i j b c e g 25.0 100.0 d f h i j e AJQN01000001.1 Sinorhizobium fredii CCBAU 25509 d f h i j e [7, 0, 2] c d f h i j b e g [4, 0, 1] d f h i j b e g 100.0 HE616890.1 Sinorhizobium fredii HH103 d f h i j e d f h i j b c e g 69.0 [0, 0, 0] 28.0 20.0 [6, 0, 1] [2, 0, 0] AMCX01000001.1 Sinorhizobium fredii GR64 b d e f g h i j [10, 0, 2] !CP003563.1 Sinorhizobium fredii USDA 257 c d f h i j e p AJQT01000001.1 Ensifer sojae CCBAU 05684 d f h i j e d f h i b c e g *AZUW01000001.1 Ensifer sp. WSM1721 b c d e f g h i a 100.0 b c d e f g h i a [17, 0, 8] 64.0 b c d e f g h i h i AZNX01000001.1 Ensifer sp. TW10 b c d e f g h i [1, 0, 1] 100.0 LATE01000001.1 Sinorhizobium sp. PC2 b c d e f g h i h i b c d e f g h i a [0, 0, 1] 10.0 *ATYB01000007.1 Sinorhizobium arboris LMG 14919 a b c d e f g i [5, 0, 5] a b c d e f g h i *ATYC01000007.1 Sinorhizobium meliloti WSM4191 b c d e f g h i j a 72.0 a b c d e f g h i j [4, 0, 3] 52.0 b c d e f g h i j ATZC01000001.1 Sinorhizobium meliloti GVPV12 b c d e f g h i j [2, 0, 1] 100.0 AQWP01000001.1 Sinorhizobium meliloti 4H41 b d e f g h i j [1, 0, 1] b c d e f g h i FMAF01000001.1 Rhizobium lusitanum P1-7 b c d e f g h i 100.0 b c d e f g h i AUFB01000001.1 Rhizobium leucaenae USDA 9039 b c d e f g h i [44, 0, 22] 100.0 b c d e f g h i [0, 0, 0] 62.0 b c d e f g h i !AQHN01000001.1 Rhizobium freirei PRF 81 b c d e f g h i p [0, 0, 0] 53.0 !CP004015.1 Rhizobium tropici CIAT 899 p b c d e f g h i [0, 0, 0]

b c d e f g h i *ATTO01000001.1 Rhizobium favelukesii OR191 a b c d e f g h i j 32.0 AEYE02000001.1 Rhizobium grahamii CCGE 502 b c d e f g h i b c d e f g h i [2, 0, 0] 100.0 b c d e f g h i [2, 0, 0] b c d e f g h i HF536772.1 Rhizobium mesoamericanum STM3625 b c d e f g h i 100.0 b c d e f g h i 96.0 ATYY01000001.1 Rhizobium mesoamericanum STM6155 b c d e f g h i [27, 0, 13] 54.0 [0, 0, 0] [4, 0, 0] b c d e f g h i *ATTQ01000001.1 Rhizobium mongolense USDA 1844 a b c d e f g h i j 100.0 CP006877.1 Rhizobium gallicum bv. gallicum R602 b c d e f g h i [2, 0, 0]

b c d e f h i j k ?CP000133.1 Rhizobium etli CFN 42 b c d e f h i j k 100.0 CP005950.1 Rhizobium etli bv. mimosae str. Mim1 b c d e f g h i [3, 0, 1] b c d e f g h i 100.0 b c d e f h i j k !?FMAJ01000001.1 Rhizobium sp. HBR26 p k b c d e f h i j [27, 0, 13] b c d e f h i j k 5.0 ?CP001074.1 Rhizobium etli CIAT 652 b c d e f h i j k 83.0 b c d e f g h i j b c d e f h i j k [0, 0, 2] [8, 0, 7] 100.0 6.0 [63, 0, 30] [0, 0, 3] b c d e f h i j k ?JFGP01000001.1 Rhizobium leguminosarum bv. phaseoli CCGM1 b c d e f h i j k b c d e f h i j k 90.0 ?AHJU02000001.1 Rhizobium phaseoli Ch24-10 b c d e f h i j k 93.0 [0, 0, 0] [5, 0, 6] b c d e f h i j k b c d e f h i j k ?CP007641.1 Rhizobium etli bv. phaseoli str. IE4803 b e f h i j k 100.0 35.0 !CP006986.1 Rhizobium etli bv. mimosae str. IE4771 p b c d e f g h i [20, 0, 13] [5, 0, 3] !?AEYF01000001.1 Rhizobium sp. CCGE 510 b c d e f h i j k p b c d e f h i j k 40.0 b c d e f h i j k !?LFIO01000001.1 Rhizobium ecuadorense CNPSO 671 p b c d e f h i j k [0, 0, 0] 34.0 ?LJSR01000001.1 Rhizobium acidisoli FH23 b c d e f h i j k [0, 0, 0] b c d e f h i j k 89.0 b c d e f g h i j ?KB905373.1 Rhizobium leguminosarum bv. phaseoli 4292 b e f h i j k [12, 0, 6] 100.0 KI911771.1 Rhizobium leguminosarum bv. trifolii CC283b b c d e f g h j [5, 0, 2] b c d e f g h i j 60.0 b d e f g h i j ?ATTN01000001.1 Rhizobium leguminosarum bv. phaseoli FA23 b e f h i j k [10, 0, 5] b c d e f g h i j 80.0 ATYQ01000001.1 Rhizobium leguminosarum bv. viciae VF39 b d e f g h i j 46.0 [3, 0, 1] [5, 0, 3] AJUI01000023.1 Rhizobium leguminosarum bv. trifolii CC278f b c d e f g h j AHAM01000001.1 Mesorhizobium alhagi CCNWXJ12-2 b c d e f g h i j ATYO01000001.1 Mesorhizobium sp. WSM3224 b c d e f g h i j b c d e f g h i j 100.0 a b c d e f g h i j *AZUX01000001.1 Mesorhizobium sp. WSM2561 a b c d e f g h i j [18, 0, 8] a b c d e f g h i j 100.0 AYWU01000001.1 Mesorhizobium sp. L48C026A00 b c d e f g h i j b c d e f g h i j 98.0 [1, 0, 1] 100.0 [1, 0, 1] *LYTO01000001.1 Mesorhizobium sp. AA22 a b c d e f g h i j [18, 0, 8] AGSN01000001.1 Mesorhizobium amorphae CCNWGS0123 b c d e f g h i j *CAAF010000001.1Mesorhizobium sp. STM 4661 a b c d e f g h i j b c d e f g h i j 86.0 a b c d e f g h i j *LYTJ01000001.1 Mesorhizobium loti NZP2014 a b c d e f g h i j [18, 0, 8] 100.0 *AXAE01000005.1 Mesorhizobium erdmanii USDA 3471 a b c d e f g h i j a b c d e f g h i j [0, 0, 0] 16.0 *LYTK01000001.1 Mesorhizobium loti NZP2042 b c d e f g h i j a b c d e f g h i j [1, 0, 1] a b c d e f g h i j !KB913026.1 Mesorhizobium loti NZP2037 b c d e f g h i j p 42.0 75.0 a b c d e f g h i j [16, 0, 7] [1, 0, 1] a b c d e f g h i j *MDLH01000001.1 Mesorhizobium sp. SEMIA 3007 a b c d e f g h i j a b c d e f g h i j 100.0 a b c d e f g h i j 64.0 *KI632510.1 Mesorhizobium loti R7A a b c d e f g h i j 46.0 [1, 0, 1] 64.0 [0, 0, 0] [6, 0, 1] [0, 0, 0] *AP003017.1 Mesorhizobium loti MAFF303099 a b c d e f g h i j

b c d e f g h i j b c d e f g h i j a b c d e f g h i j CP003358.1 Mesorhizobium australicum WSM2073 d f g h i j 41.0 42.0 51.0 *AZUY01000001.1 Mesorhizobium sp. WSM3626 a b c d e f g h j [16, 0, 7] [10, 0, 3] [5, 0, 0]

b c d e f g h i j b c d e f g h i j !AYWP01000001.1 Mesorhizobium sp. LNHC232B00 b c d e f g h i j p 45.0 95.0 CP002279.1 Mesorhizobium opportunistum WSM2075 d f g h i j [13, 0, 5] [3, 0, 1] b c d e f g h i j !AZUV01000001.1 Mesorhizobium sp. WSM1293 b c d e f g h i j p b c d e f g h i j b c d e f g h i j 100.0 CP002447.1 Mesorhizobium ciceri biovar biserrulae WSM1271 d f g h i j 67.0 100.0 [3, 0, 1] [14, 0, 6] [3, 0, 1] !CP015064.1 Mesorhizobium ciceri biovar biserrulae WSM1284 b c d e f g h i j p

b c d e f g h i j a b c d e f g h i j *KI912159.1 Mesorhizobium loti R88b a b c d e f g h i j 13.0 100.0 *AXAL01000001.1 Mesorhizobium loti CJ3sym a b c d e f g h i j [15, 0, 7] [0, 0, 0] AYVY01000001.1 Mesorhizobium sp. LSHC420B00 b c d e f g h j b c d e f g h i j AYWZ01000001.1 Mesorhizobium sp. L2C066B000 b c d e f g h i j 35.0 b c d e f g h i j AYXF01000001.1 Mesorhizobium sp. L103C105A0 b c d e f g h i j [1, 0, 0] 100.0 b c d e f g h i j AYVN01000001.1 Mesorhizobium sp. LSJC269B00 b c d e f g h i j [0, 0, 0] 67.0 b c d e f g h i j AYVM01000001.1 Mesorhizobium sp. LSJC277A00 b c d e f g h i j [0, 0, 0] 56.0 b c d e f g h i j AYWA01000001.1 Mesorhizobium sp. LSHC414A00 b c d e f g h i j [0, 0, 0] 50.0 b c d e f g h i j AYVP01000001.1 Mesorhizobium sp. LSJC265A00 b c d e f g h i j [0, 0, 0] 56.0 b c d e f g h i j AYVX01000001.1 Mesorhizobium sp. LSHC422A00 b c d e f g h i j [0, 0, 0] 56.0 b c d e f g h i j AYWW01000001.1 Mesorhizobium sp. L2C085B000 b c d e f g h i j [0, 0, 0] 56.0 b c d e f g h i j AYWB01000001.1 Mesorhizobium sp. LSHC412B00 b c d e f g h i j [0, 0, 0] 56.0 b c d e f g h i j AYVO01000001.1 Mesorhizobium sp. LSJC268A00 b c d e f g h i j [0, 0, 0] 56.0 b c d e f g h i j [0, 0, 0] 56.0 b c d e f g h i j AYWN01000001.1 Mesorhizobium sp. LNJC372A00 b c d e f g h i j [0, 0, 0] 56.0 AYVS01000001.1 Mesorhizobium sp. LSHC440B00 b c d e f g h i j [0, 0, 0] 0.09

Figure 6.6: Ancestral reconstruction of the GA biosynthetic operon using rpoB for phylogenetic analysis. A phylogenetic tree was constructed with alignments of rpoB protein sequences from 118 α-rhizobia species and two γ-proteobacteria using the Neighbor-Joining method as a measure of the distance between species. ROAGUE was then applied to create the ancestral operon reconstruction. Annotations are defined in section 6.1. 68

Deletion count: 55 Duplication count: 0 Split count: 20 a : cyp115 b : cyp112 c : cyp114 d : fd e : sdr f : cyp117 g : ggps h : cps i : ks j : idi k : ggps2 p : cyp115-pseudo/fragment KE136308.1 Erwinia tracheiphila PSU-1 a b c d e f g h i j AEYE02000001.1 Rhizobium grahamii CCGE 502 b c d e f g h i b c d e f g h i 100.0 b c d e f g h i HF536772.1 Rhizobium mesoamericanum STM3625 b c d e f g h i [0, 0, 0] 100.0 ATYY01000001.1 Rhizobium mesoamericanum STM6155 b c d e f g h i [0, 0, 0] CP003057.2 Xanthomonas oryzae BLS256(reference) a b c d e f g h i j AXBA01000001.1 Azorhizobium doebereinerae UFLA1-100 b c d e f g h i b c d e f g h i j 100.0 c e f g h i !AJQL01000001.1 Bradyrhizobium yuanmingense CCBAU 35157 e f g h i p [55, 0, 20] c e f g h i 100.0 !AXAZ01000001.1 Bradyrhizobium sp. WSM1743 c d e f g h p 31.0 [3, 0, 1] [3, 0, 1] !KB900701.1 Bradyrhizobium elkanii USDA 76 c e f g h i p !KB890498.1 Bradyrhizobium sp. WSM4349 p c d f g h j c d e f g h i j c d i f g h j !CP003563.1 Sinorhizobium fredii USDA 257 c d f h i j e p b c d e f g h i j 100.0 37.0 d f h i j c e AJQT01000001.1 Ensifer sojae CCBAU 05684 d f h i j e 67.0 [11, 0, 2] [4, 0, 0] 100.0 d f h i j e [54, 0, 20] [1, 0, 0] 100.0 d f h i j e AJQN01000001.1 Sinorhizobium fredii CCBAU 25509 d f h i j e c d f g h i j [0, 0, 0] 100.0 HE616890.1 Sinorhizobium fredii HH103 d f h i j e 37.0 [0, 0, 0] [5, 0, 0] CP003358.1 Mesorhizobium australicum WSM2073 d f g h i j d f g h i j 100.0 d f g h i j CP002279.1 Mesorhizobium opportunistum WSM2075 d f g h i j b c d e f g h i j [0, 0, 0] 73.0 CP002447.1 Mesorhizobium ciceri biovar biserrulae WSM1271 d f g h i j 100.0 [0, 0, 0] [53, 0, 20] b c d e f g h i j ATYQ01000001.1 Rhizobium leguminosarum bv. viciae VF39 b d e f g h i j 100.0 KI911771.1 Rhizobium leguminosarum bv. trifolii CC283b b c d e f g h j [2, 0, 1]

b c d e f h i j k !?AEYF01000001.1 Rhizobium sp. CCGE 510 b c d e f h i j k p 100.0 ?LJSR01000001.1 Rhizobium acidisoli FH23 b c d e f h i j k [0, 0, 0]

k b c d e f h i j !?FMAJ01000001.1 Rhizobium sp. HBR26 p k b c d e f h i j k b c d e f h i j 100.0 ?CP007641.1 Rhizobium etli bv. phaseoli str. IE4803 b e f h i j k 100.0 [2, 0, 1] [6, 0, 5] b c d e f h i j k ?CP001074.1 Rhizobium etli CIAT 652 b c d e f h i j k k b c d e f h i j 79.0 !?LFIO01000001.1 Rhizobium ecuadorense CNPSO 671 p b c d e f h i j k [0, 0, 1] b c d e f g h i j 76.0 b c d e f h i j k 47.0 [6, 0, 4] 66.0 b c d e f g h i j b c d e f h i j k ?AHJU02000001.1 Rhizobium phaseoli Ch24-10 b c d e f h i j k [52, 0, 20] [2, 0, 2] 100.0 b c d e f h i j k b c d e f h i j k 92.0 ?CP000133.1 Rhizobium etli CFN 42 b c d e f h i j k [9, 0, 6] 80.0 90.0 [0, 0, 0] b c d e f h i j k [2, 0, 2] [2, 0, 1] ?ATTN01000001.1 Rhizobium leguminosarum bv. phaseoli FA23 b e f h i j k 100.0 ?JFGP01000001.1 Rhizobium leguminosarum bv. phaseoli CCGM1 b c d e f h i j k [4, 0, 3] ?KB905373.1 Rhizobium leguminosarum bv. phaseoli 4292 b e f h i j k b c d e f g h i j b c d e f g h i !CP006986.1 Rhizobium etli bv. mimosae str. IE4771 p b c d e f g h i 63.0 100.0 CP005950.1 Rhizobium etli bv. mimosae str. Mim1 b c d e f g h i [11, 0, 7] [0, 0, 0]

b c d e f g h i j AJUI01000023.1 Rhizobium leguminosarum bv. trifolii CC278f b c d e f g h j 23.0 ATYO01000001.1 Mesorhizobium sp. WSM3224 b c d e f g h i j b c d e f g h i b c d e f g h i j [1, 0, 0] 100.0 82.0 [12, 0, 8] [2, 0, 1] a b c d e f g h i j *ATYC01000007.1 Sinorhizobium meliloti WSM4191 b c d e f g h i j a 43.0 *ATTO01000001.1 Rhizobium favelukesii OR191 a b c d e f g h i j [0, 0, 1] b c d e f g h i !LJYF01000001.1 Bradyrhizobium yuanmingense BR3267 p b c e f g h i d 85.0 b c d e f g h i [13, 0, 8] 100.0 b c d e f g h i FMAI01000001.1 Bradyrhizobium sp. ERR11 b c d e f g h i [0, 0, 1] 90.0 !KB902768.1 Bradyrhizobium sp. WSM2793 b c d e f g h i p [0, 0, 0] b c d e f g h i 27.0 b c d e f g h i AXBC01000011.1 Bradyrhizobium genosp. SA-4 str. CB756 b c e f g h i [13, 0, 8] b c d e f g h i j b c d e f g h i 100.0 AXAD01000001.1 Bradyrhizobium sp. USDA 3384 b c d e f g h i 17.0 14.0 [1, 0, 0] [40, 0, 18] b c d e f g h i [13, 0, 8] !LFIP02000001.1 Bradyrhizobium embrapense SEMIA 6208 b c d e f g h i p 2.0 !AXAU01000001.1 Bradyrhizobium elkanii WSM1741 p b c d e f g h i [14, 0, 9] !LJYE01000001.1 Bradyrhizobium pachyrhizi BR3262 p b c d e g h i !AXAT01000001.1 Bradyrhizobium japonicum USDA 135 p b c d e f g h i !AP014659.1 Bradyrhizobium diazoefficiens NK6 b c d e f g h i p !AJQG01000001.1 Bradyrhizobium sp. CCBAU 15615 p b c d e f g h i b c d e f g h i !AP012206.1 Bradyrhizobium japonicum USDA 6 p b c d e f g h i 99.0 !BA000040.2 Bradyrhizobium diazoefficiens USDA 110 p b c d e f g h i [0, 0, 0] b c d e f g h i !CP010313.1 Bradyrhizobium japonicum E109 p b c d e f g h i b c d e f g h i b c d e f g h i 60.0 b c d e f g h i b c d e f g h i b c d e f g h i 2.0 29.0 [0, 0, 0] b c d e f g h i !CP007569.1 Bradyrhizobium japonicum SEMIA 5079 p b c d e f g h i 86.0 93.0 42.0 [14, 0, 9] [0, 0, 0] b c d e f g h i 41.0 !JGCL01000001.1 Bradyrhizobium japonicum FN1 p b c d e f g h i [0, 0, 0] [0, 0, 0] [0, 0, 0] b c d e f g h i 41.0 [0, 0, 0] 48.0 [0, 0, 0] JRPN01000001.1 Bradyrhizobium japonicum Is-34 b c d e f g h i b c d e f g h i [0, 0, 0] 59.0 b c d e f g h i AXAG01000003.1 Bradyrhizobium japonicum USDA 38 b c d e f g h i b c d e f g h i [0, 0, 0] 100.0 b c d e f g h i 86.0 !AXAF01000003.1 Bradyrhizobium japonicum USDA 4 b c d e f g h i p 54.0 [0, 0, 0] b c d e f g h i [0, 0, 0] [0, 0, 0] !AJQH01000001.1 Bradyrhizobium sp. CCBAU 15635 b c d e f g h i p 38.0 !AJQE01000001.1 Bradyrhizobium sp. CCBAU 43298 p b c d e f g h i [0, 0, 0] !AXVP01000001.1 Bradyrhizobium japonicum USDA 123 p b c d e f g h i

b c d e f g h i b c d e f g h i !AXAX01000001.1 Bradyrhizobium japonicum USDA 122 b c d e f g h i p 99.0 b c d e f g h i 97.0 !LGUJ01000001.1 Bradyrhizobium japonicum Is-1 p b c d e f g h i [0, 0, 0] b c d e f g h i 97.0 [0, 0, 0] 80.0 [0, 0, 0] !ADOU02000001.1 Bradyrhizobium diazoefficiens SEMIA 5080 b c d e f g h i p [0, 0, 0] !KB893805.1 Bradyrhizobium japonicum USDA 124 p b c d e f g h i

b c d e f g h i !AXAH01000003.1 Bradyrhizobium elkanii USDA 3254 p b c d e f g h i b c d e f g h i 63.0 !AXAW01000001.1 Bradyrhizobium elkanii USDA 3259 b c d e f g h i p 100.0 [0, 0, 0] [0, 0, 0] !LFIQ01000001.1 Bradyrhizobium pachyrhizi PAC 48 p b c d e f g h i

b c d e f g h i j !CP015064.1 Mesorhizobium ciceri biovar biserrulae WSM1284 b c d e f g h i j p 100.0 !AZUV01000001.1 Mesorhizobium sp. WSM1293 b c d e f g h i j p [0, 0, 0]

b c d e f g h i FMAF01000001.1 Rhizobium lusitanum P1-7 b c d e f g h i b c d e f g h i 100.0 AUFB01000001.1 Rhizobium leucaenae USDA 9039 b c d e f g h i b c d e f g h i j b c d e f g h i 100.0 [0, 0, 0] 3.0 100.0 [0, 0, 0] !AQHN01000001.1 Rhizobium freirei PRF 81 b c d e f g h i p [38, 0, 17] b c d e f g h i [0, 0, 0] !CP004015.1 Rhizobium tropici CIAT 899 p b c d e f g h i 87.0 [2, 0, 1] b c d e f g h i CP006877.1 Rhizobium gallicum bv. gallicum R602 b c d e f g h i b c d e f g h i 96.0 AZYE01000987.1 Microvirga lupini Lut6 b d e f g h i j 54.0 [2, 0, 1] [5, 0, 3] JFGO01000001.1 Sinorhizobium americanum CCGM7 b c d e f g h i b c d e f g h i b c d e f g h i j 96.0 b d e f g h i j AMCX01000001.1 Sinorhizobium fredii GR64 b d e f g h i j 56.0 [3, 0, 2] b d e f g h i j 100.0 ATZC01000001.1 Sinorhizobium meliloti GVPV12 b c d e f g h i j [7, 0, 3] 100.0 [1, 0, 1] [1, 0, 1] AQWP01000001.1 Sinorhizobium meliloti 4H41 b d e f g h i j AYVY01000001.1 Mesorhizobium sp. LSHC420B00 b c d e f g h j !AYWP01000001.1 Mesorhizobium sp. LNHC232B00 b c d e f g h i j p b c d e f g h i j 100.0 b c d e f g h i j AYWB01000001.1 Mesorhizobium sp. LSHC412B00 b c d e f g h i j [0, 0, 0] 100.0 AYVP01000001.1 Mesorhizobium sp. LSJC265A00 b c d e f g h i j [0, 0, 0]

b c d e f g h i j AYXF01000001.1 Mesorhizobium sp. L103C105A0 b c d e f g h i j 41.0 AYWN01000001.1 Mesorhizobium sp. LNJC372A00 b c d e f g h i j b c d e f g h i j [0, 0, 0] 100.0 [0, 0, 0] b c d e f g h i j b c d e f g h i j AYVS01000001.1 Mesorhizobium sp. LSHC440B00 b c d e f g h i j 58.0 100.0 AYWZ01000001.1 Mesorhizobium sp. L2C066B000 b c d e f g h i j [0, 0, 0] b c d e f g h i j [0, 0, 0] b c d e f g h i j 77.0 AYWW01000001.1 Mesorhizobium sp. L2C085B000 b c d e f g h i j 20.0 [0, 0, 0] b c d e f g h i j [10, 0, 5] b c d e f g h i j b c d e f g h i j b c d e f g h i j 78.0 b c d e f g h i j AYVN01000001.1 Mesorhizobium sp. LSJC269B00 b c d e f g h i j b c d e f g h i j 29.0 91.0 61.0 [0, 0, 0] 99.0 AYVO01000001.1 Mesorhizobium sp. LSJC268A00 b c d e f g h i j 21.0 [0, 0, 0] [23, 0, 8] [0, 0, 0] [0, 0, 0] [0, 0, 0] AYVM01000001.1 Mesorhizobium sp. LSJC277A00 b c d e f g h i j

b c d e f g h i j AYWA01000001.1 Mesorhizobium sp. LSHC414A00 b c d e f g h i j 62.0 AYVX01000001.1 Mesorhizobium sp. LSHC422A00 b c d e f g h i j b c d e f g h i j a [0, 0, 0] 18.0 AYWU01000001.1 Mesorhizobium sp. L48C026A00 b c d e f g h i j [2, 0, 2] !KB913026.1 Mesorhizobium loti NZP2037 b c d e f g h i j p b c d e f g h i j a 100.0 b c d e f g h i j a *LYTJ01000001.1 Mesorhizobium loti NZP2014 a b c d e f g h i j b c d e f g h i j [1, 0, 1] 89.0 *LYTK01000001.1 Mesorhizobium loti NZP2042 b c d e f g h i j a 20.0 b c d e f g h i j a [0, 0, 0] [11, 0, 5] 100.0 *AXAE01000005.1 Mesorhizobium erdmanii USDA 3471 a b c d e f g h i j [1, 0, 1] *CAAF010000001.1Mesorhizobium sp. STM 4661 a b c d e f g h i j a b c d e f g h i j a b c d e f g h i j a b c d e f g h i j 56.0 33.0 a b c d e f g h i j *AP003017.1 Mesorhizobium loti MAFF303099 a b c d e f g h i j [2, 0, 2] 75.0 [0, 0, 0] [0, 0, 0] 100.0 *MDLH01000001.1 Mesorhizobium sp. SEMIA 3007 a b c d e f g h i j a b c d e f g h i j [0, 0, 0] 95.0 b c d e f g h i j [0, 0, 0] a b c d e f g h i j *AXAL01000001.1 Mesorhizobium loti CJ3sym a b c d e f g h i j 2.0 a b c d e f g h i j 100.0 *KI632510.1 Mesorhizobium loti R7A a b c d e f g h i j [12, 0, 5] 100.0 [0, 0, 0] [0, 0, 0] *KI912159.1 Mesorhizobium loti R88b a b c d e f g h i j *ATTQ01000001.1 Rhizobium mongolense USDA 1844 a b c d e f g h i j b c d e f g h i j b c d e f g h i !JH600072.1 Bradyrhizobium sp. WSM1253 b c d e f g h i p 1.0 100.0 !CM001442.1 Bradyrhizobium sp. WSM471 b c d e f g h i p [14, 0, 6] [0, 0, 0] !KI911783.1 Bradyrhizobium sp. WSM1417 b c d e f g h j p

a i b c d e f g h j *AZUX01000001.1 Mesorhizobium sp. WSM2561 a b c d e f g h i j 100.0 *AZUY01000001.1 Mesorhizobium sp. WSM3626 a b c d e f g h j [1, 0, 0]

b c d e f g h i j b c d e f g h i LATE01000001.1 Sinorhizobium sp. PC2 b c d e f g h i h i 1.0 b c d e f g h i 100.0 AZNX01000001.1 Ensifer sp. TW10 b c d e f g h i [23, 0, 8] 46.0 [0, 0, 1] b c d e f g h i [0, 0, 1] CP000874.1 Sinorhizobium fredii NGR234 b c d e f g h i 23.0 [2, 0, 2] b c d e f g h i a *AZUW01000001.1 Ensifer sp. WSM1721 b c d e f g h i a 98.0 *ATYB01000007.1 Sinorhizobium arboris LMG 14919 a b c d e f g i [1, 0, 1] *LYTO01000001.1 Mesorhizobium sp. AA22 a b c d e f g h i j b c d e f g h i b c d e f g h i 2.0 19.0 k b c d e f h i !?AXAB01000001.1 Bradyrhizobium sp. WSM2254 k b c d e f h i p [8, 0, 2] [4, 0, 0] 86.0 ?AXAY01000001.1 Bradyrhizobium sp. WSM3983 b c d e f h i k b c d e f g h i [0, 0, 0] 2.0 [5, 0, 0] b c d e f g h i !AUGA01000001.1 Bradyrhizobium sp. th.b2 b c d e f g h i p b c d e f g h i 34.0 AGSN01000001.1 Mesorhizobium amorphae CCNWGS0123 b c d e f g h i j 1.0 [1, 0, 0] [6, 0, 0] b c d e f g h i !AUFA01000001.1 Bradyrhizobium sp. Cp5.3 p b c d e f g h i 7.0 AHAM01000001.1 Mesorhizobium alhagi CCNWXJ12-2 b c d e f g h i j [1, 0, 0] 0.11

Figure 6.7: Ancestral reconstruction of the GA biosynthetic operon using the concatenated operon for phylogenetic analysis. A phylogenetic tree was constructed with alignments of concatenated proteins from core GA operon genes (cyp112-cyp114-fd-sdr-cyp117-cps-ks) from 118 α-rhizobia species and two β-proteobacteria using the Neighbor-Joining method. ROAGUE was then applied to create the ancestral operon reconstruction. ROAGUE was then applied to create the ancestral operon reconstruction. Annotations are defined in section 6.1 69

6.3 Appendix: Supplementary Materials

Table 6.1: Pairwise alignments of GGPS2 proteins and representative GGPS proteins. Protein sequences were aligned using the MUSCLE algorithm (Geneious Prime; default settings). The GGPS proteins included in this analysis were selected due to the relatively close phylogenetic relationship between the core operon of this strain and that of the ggps2-containing strains, which lack a full-length ggps gene (i.e. Rhizobium etli bv. mimosae str. Mim1 has the most similar core operon to the ggps2-containing Rhizobium strains and Bradyrhizobium sp. WSM1417 has the most similar core operon to ggps2-containing Bradyrhizobium strains). Sequences with greater 90% amino acid identity are highlighted in dark blue, while those with greater 80% but less than 90% amino acid identity are highlighted in light blue

Rhizobium etli Rhizobium sp. Rhizobium sp. Rhizobium etli Bradyrhizobium sp. Bradyrhizobium sp. Bradyrhizobium sp. bv. mimosae CCGE 510 HBR26 CFN 42 WSM2254 WSM3983 WSM1417 str. Mim1

Rhizobium sp.

CCGE 510

Rhizobium sp. 95 HBR26

Rhizobium etli 96 97 CFN 42

Bradyrhizobium sp. 83 84 84 WSM2254

Bradyrhizobium sp. 83 84 84 94 WSM3983

Rhizobium etli

bv. mimosae 28 29 29 30 30

str. Mim1

Bradyrhizobium sp. 28 29 29 29 29 88 WSM1417 70

Table 6.2: List of strains used within the ancestral reconstruction analyses. Shown for each strain is the corresponding GenBank genome accession, the legume host from which it was isolated, and the nodule type (D = determinate, I = indeterminate, * = neither). Note that the legume host and nodule type are irrelevant for the gammaproteobacteria plant pathogens analyzed in these reconstructions.

Genome acces- Strain Legume Type

sion host

AXBA01000001.1 Azorhizobium Sesbania *

doebereinerae virgata

UFLA1-100

AP014659.1 Bradyrhizobium Glycine D

diazoefficiens max

NK6

ADOU02000001.1 Bradyrhizobium Glycine D

diazoefficiens max

SEMIA 5080

BA000040.2 Bradyrhizobium Glycine D

diazoefficiens max

USDA 110

AXAH01000003.1 Bradyrhizobium Phaseolus D

elkanii USDA acutifolius

3254

AXAW01000001.1 Bradyrhizobium Phaseolus D

elkanii USDA lunatus

3259

KB900701.1 Bradyrhizobium Glycine D

elkanii USDA 76 max 71

Table 6.2 Continued

Genome acces- Strain Legume Type sion host

AXAU01000001.1 Bradyrhizobium Rhynchosia D

elkanii WSM1741 minima

LFIP02000001.1 Bradyrhizobium Desmodium D

embrapense heterocar-

SEMIA 6208 pon

AXBC01000011.1 Bradyrhizobium Macrotyloma D

genosp. SA-4 str. africanum

CB756

CP010313.1 Bradyrhizobium Glycine D

japonicum E109 max

JGCL01000001.1 Bradyrhizobium Glycine D

japonicum FN1 max (soil

isolate)

LGUJ01000001.1 Bradyrhizobium Glycine D

japonicum Is-1 max

JRPN01000001.1 Bradyrhizobium Glycine D

japonicum Is-34 max

CP007569.1 Bradyrhizobium Glycine D

japonicum max

SEMIA 5079

AXAX01000001.1 Bradyrhizobium Glycine D

japonicum USDA max

122 72

Table 6.2 Continued

Genome acces- Strain Legume Type sion host

AXVP01000001.1 Bradyrhizobium Glycine D

japonicum USDA max

123

KB893805.1 Bradyrhizobium Glycine D

japonicum USDA max

124

AXAT01000001.1 Bradyrhizobium Glycine D

japonicum USDA max

135

AXAG01000003.1 Bradyrhizobium Glycine D

japonicum USDA max

38

AXAF01000003.1 Bradyrhizobium Glycine D

japonicum USDA max

4

AP012206.1 Bradyrhizobium Glycine D

japonicum USDA max

6

LJYE01000001.1 Bradyrhizobium Vigna D

pachyrhizi unguiculata

BR3262

LFIQ01000001.1 Bradyrhizobium Pachyrhizus D

pachyrhizi PAC erosus

48 73

Table 6.2 Continued

Genome acces- Strain Legume Type sion host

AJQG01000001.1 Bradyrhizobium Glycine D

sp. CCBAU max

15615

AJQH01000001.1 Bradyrhizobium Glycine D

sp. CCBAU max

15635

AJQE01000001.1 Bradyrhizobium Glycine D

sp. CCBAU max

43298

AUFA01000001.1 Bradyrhizobium Centrosema D

sp. Cp5.3 pubescens

FMAI01000001.1 Bradyrhizobium Erythrina D

sp. ERR11 brucei

AUGA01000001.1 Bradyrhizobium AmphicarpaeaD

sp. th.b2 bracteata

AXAD01000001.1 Bradyrhizobium Kennedia D

sp. USDA 3384 coccinea

JH600072.1 Bradyrhizobium Ornithopus D

sp. WSM1253 compressus

KI911783.1 Bradyrhizobium Lupinus sp. I

sp. WSM1417

AXAZ01000001.1 Bradyrhizobium Indigofera I

sp. WSM1743 sp. 74

Table 6.2 Continued

Genome acces- Strain Legume Type sion host

AXAB01000001.1 Bradyrhizobium Acacia I

sp. WSM2254 dealbata

KB902768.1 Bradyrhizobium Rhynchosia D

sp. WSM2793 totta

AXAY01000001.1 Bradyrhizobium Kennedia D

sp. WSM3983 coccinea

KB890498.1 Bradyrhizobium Syrmatium D*

sp. WSM4349 glabrum

CM001442.1 Bradyrhizobium Ornithopus D

sp. WSM471 pinnatus

LJYF01000001.1 Bradyrhizobium Kennedia D

yuanmingense coccinea

BR3267

AJQL01000001.1 Bradyrhizobium Glycine D

yuanmingense max

CCBAU 35157

AJQT01000001.1 Ensifer sojae Glycine D

CCBAU 5684 max

AZNX01000001.1 Ensifer sp. TW10 Tephrosia I

purpurea

AZUW01000001.1 Ensifer sp. Indigofera I

WSM1721 sp. 75

Table 6.2 Continued

Genome acces- Strain Legume Type sion host

AHAM01000001.1 Mesorhizobium Alhagi I

alhagi sparsifolia

CCNWXJ12-

2

AGSN01000001.1 Mesorhizobium Robinia I

amorphae CCN- pseudoaca-

WGS0123 cia

CP003358.1 Mesorhizobium Biserrula I

australicum pelecinus

WSM2073

CP002447.1 Mesorhizobium Biserrula I

ciceri bv. biserru- pelecinus

lae WSM1271

CP015064.1 Mesorhizobium Biserrula I

ciceri bv. biserru- pelecinus

lae WSM1284

AXAE01000005.1 Mesorhizobium Lotus cor- D

erdmanii USDA niculatus

3471

AXAL01000001.1 Mesorhizobium Lotus cor- D

loti CJ3sym niculatus

AP003017.1 Mesorhizobium Lotus pe- D

loti MAFF303099 dunculatus 76

Table 6.2 Continued

Genome acces- Strain Legume Type sion host

LYTJ01000001.1 Mesorhizobium Lotus sp. D

loti NZP2014

KB913026.1 Mesorhizobium Lotus D

loti NZP2037 divaricatus

LYTK01000001.1 Mesorhizobium Lotus sp. D

loti NZP2042

KI632510.1 Mesorhizobium Lotus cor- D

loti R7A niculatus

KI912159.1 Mesorhizobium Lotus cor- D

loti R88b niculatus

CP002279.1 Mesorhizobium Biserrula I

opportunistum pelecinus

WSM2075

LYTO01000001.1 Mesorhizobium Biserrula I

sp. AA22 pelecinus L

AYXF01000001.1 Mesorhizobium Acmispon D

sp. L103C105A0 wrangelianus

AYWZ01000001.1 Mesorhizobium Acmispon D

sp. L2C066B000 wrangelianus

AYWW01000001.1 Mesorhizobium Acmispon D

sp. L2C085B000 wrangelianus

AYWU01000001.1 Mesorhizobium Acmispon D

sp. L48C026A00 wrangelianus 77

Table 6.2 Continued

Genome acces- Strain Legume Type sion host

AYWP01000001.1 Mesorhizobium Acmispon D

sp. LNHC232B00 wrangelianus

AYWN01000001.1 Mesorhizobium Acmispon D

sp. LNJC372A00 wrangelianus

AYWB01000001.1 Mesorhizobium Acmispon D

sp. LSHC412B00 wrangelianus

AYWA01000001.1 Mesorhizobium Acmispon D

sp. LSHC414A00 wrangelianus

AYVY01000001.1 Mesorhizobium Acmispon D

sp. LSHC420B00 wrangelianus

AYVX01000001.1 Mesorhizobium Acmispon D

sp. LSHC422A00 wrangelianus

AYVS01000001.1 Mesorhizobium Acmispon D

sp. LSHC440B00 wrangelianus

AYVP01000001.1 Mesorhizobium Acmispon D

sp. LSJC265A00 wrangelianus

AYVO01000001.1 Mesorhizobium Acmispon D

sp. LSJC268A00 wrangelianus

AYVN01000001.1 Mesorhizobium Acmispon D

sp. LSJC269B00 wrangelianus

AYVM01000001.1 Mesorhizobium Acmispon D

sp. LSJC277A00 wrangelianus

MDLH01000001.1 Mesorhizobium Pisum I

sp. SEMIA 3007 sativum 78

Table 6.2 Continued

Genome acces- Strain Legume Type sion host

CAAF010000001.1 Mesorhizobium Anthyllis D

sp. STM 4661 vulneraria

AZUV01000001.1 Mesorhizobium Lotus sp. D

sp. WSM1293

AZUX01000001.1 Mesorhizobium Lessertia I

sp. WSM2561 diffusa

ATYO01000001.1 Mesorhizobium Otholobium D

sp. WSM3224 candicans

AZUY01000001.1 Mesorhizobium Lessertia I

sp. WSM3626 diffusa

AZYE01000987.1 Microvirga lupini Lupinus I

Lut6 texensis

LJSR01000001.1 Rhizobium acidis- Phaseolus D

oli FH23 vulgaris

LFIO01000001.1 Rhizobium Phaseolus D

ecuadorense vulgaris

CNPSO 671

CP006986.1 Rhizobium etli Phaseolus D

bv. mimosae str. vulgaris

IE4771

CP005950.1 Rhizobium etli Mimosa I

bv. mimosae str. affinis

Mim1 79

Table 6.2 Continued

Genome acces- Strain Legume Type sion host

CP007641.1 Rhizobium etli Phaseolus D

bv. phaseoli str. vulgaris

IE4803

CP000133.1 Rhizobium etli Phaseolus D

CFN 42 vulgaris

CP001074.1 Rhizobium etli Phaseolus D

CIAT 652 vulgaris

ATTO01000001.1 Rhizobium Medicago I

favelukesii OR191 sativa

AQHN01000001.1 Rhizobium freirei Phaseolus D

PRF 81 vulgaris

CP006877.1 Rhizobium gal- Phaseolus D

licum bv. gal- vulgaris

licum R602

AEYE02000001.1 Rhizobium gra- Dalea I

hamii CCGE leporin

502

KB905373.1 Rhizobium legu- Phaseolus D

minosarum bv. vulgaris

phaseoli 4292

JFGP01000001.1 Rhizobium legu- Phaseolus D

minosarum bv. vulgaris

phaseoli CCGM1 80

Table 6.2 Continued

Genome acces- Strain Legume Type sion host

ATTN01000001.1 Rhizobium legu- Phaseolus D

minosarum bv. vulgaris

phaseoli FA23

AJUI01000023.1 Rhizobium legu- Trifolium I

minosarum bv. nanum

trifolii CC278f

KI911771.1 Rhizobium legu- Trifolium I

minosarum bv. ambiguum

trifolii CC283b

ATYQ01000001.1 Rhizobium legu- Vicia faba I

minosarum bv.

viciae VF39

AUFB01000001.1 Rhizobium leu- Phaseolus D

caenae USDA vulgaris

9039

FMAF01000001.1 Rhizobium lusi- Phaseolus D

tanum P1-7 vulgaris

HF536772.1 Rhizobium Mimosa pu- I

mesoamericanum dica

STM3625

ATYY01000001.1 Rhizobium Mimosa pu- I

mesoamericanum dica

STM6155 81

Table 6.2 Continued

Genome acces- Strain Legume Type sion host

ATTQ01000001.1 Rhizobium mon- Medicago I

golense USDA ruthenica

1844

AHJU02000001.1 Rhizobium phase- Phaseolus D

oli Ch24-10 vulgaris

AEYF01000001.1 Rhizobium sp. Phaseolus D

CCGE 510 albescens

FMAJ01000001.1 Rhizobium sp. Phaseolus D

HBR26 vulgaris

CP004015.1 Rhizobium tropici Phaseolus D

CIAT 899 vulgaris

JFGO01000001.1 Sinorhizobium Phaseolus D

americanum vulgaris

CCGM7

ATYB01000007.1 Sinorhizobium ar- Prosopis I

boris LMG 14919 chilensis

AJQN01000001.1 Sinorhizobium Glycine D

fredii CCBAU max

25509

AMCX01000001.1 Sinorhizobium Phaseolus D

fredii GR64 vulgaris

HE616890.1 Sinorhizobium Glycine D

fredii HH103 max 82

Table 6.2 Continued

Genome acces- Strain Legume Type sion host

CP000874.1 Sinorhizobium Lablab pur- D

fredii NGR234 pureus

CP003563.1 Sinorhizobium Glycine D

fredii USDA 257 max

AQWP01000001.1 Sinorhizobium Phaseolus D

meliloti 4H41 vulgaris

ATZC01000001.1 Sinorhizobium Phaseolus D

meliloti GVPV12 vulgaris

ATYC01000007.1 Sinorhizobium Melilotus I

meliloti siculus

WSM4191

LATE01000001.1 Sinorhizobium sp. Prosopis I

PC2 cineraria

CP003057.2 Xanthomonas

oryzae pv. oryzi-

cola BLS256

KE136308.1 Erwinia tra-

cheiphila PSU-1 83

Table 6.3: Strains containing a CYP 114 − F dGA gene fusion.

Bradyrhizobium sp. ERR11

Bradyrhizobium yuanmingense BR3267

Rhizobium etli CIAT 652

Rhizobium etli CNPAF512

Rhizobium etli bv. phaseoli IE4803

Rhizobium phaseoli Ch24-10

Rhizobium etli CFN 42

Rhizobium leguminosarum bv. phaseoli CCGM1

Rhizobium acidisoli FH23

Rhizobium sp. HBR26

Mesorhizobium sp. WSM3224

Rhizobium etli bv. phaseoli str. IE4803

Rhizobium leguminosarum bv. trifolii CC278f

Rhizobium leguminosarum bv. phaseoli FA23

Rhizobium etli bv. mimosae str. Mim1

Sinorhizobium meliloti WSM4191

Table 6.4: Strains containing a F dGA − SDRGA gene fusion.

Mesorhizobium sp. LSHC414A00

Mesorhizobium loti MAFF303099

Mesorhizobium loti CJ3sym

Mesorhizobium loti R7A

Rhizobium mongolense USDA 1844

Mesorhizobium sp. LSJC269B00

Mesorhizobium sp. LSJC277A00 84

Table 6.4 Continued

Mesorhizobium sp. LSJC265A00

Mesorhizobium sp. L2C066B000

Mesorhizobium sp. L2C085B000

Bradyrhizobium sp. WSM1253

Mesorhizobium loti NZP2042

Mesorhizobium sp. LSHC440B00

Mesorhizobium sp. SEMIA 3007

Mesorhizobium sp. L48C026A00

Mesorhizobium sp. LNHC232B00

Bradyrhizobium sp. WSM471

Bradyrhizobium japonicum USDA 135

Mesorhizobium sp. LSJC268A00

Mesorhizobium sp. STM 4661

Mesorhizobium loti NZP2037

Mesorhizobium loti R88b

Mesorhizobium loti NZP2014

Mesorhizobium sp. LSHC420B00

Mesorhizobium sp. LNJC372A00

Mesorhizobium sp. LSHC412B00

Mesorhizobium erdmanii USDA 3471

Mesorhizobium sp. L103C105A0

Mesorhizobium sp. LSHC422A00 85

CHAPTER 7. DISCUSSION AND FUTURE DIRECTIONS

Evolution is an important force that drives the complexity of biological systems. Gene blocks in bacteria are the products of the evolution of complex systems. Many models were proposed to explain the evolution of gene blocks in bacteria; however, the event-based model provided more practical tool to study the evolutionary change in bacterial gene blocks. In this thesis, I presented three studies on the evolution of gene blocks using the event-based model. I hope to draw more attention to this fine niche of evolution study and promote the usage of the event-based model to quantify the evolutionary change in gene blocks. There are a few directions that can be applied to my current methods so that the model can tackle more difficult and important problems.

In this chapter, I will discuss three chapters4,5, and6. Each project in the discussion section is organized into 3 components:

• Problem: the biological problem of interest.

• Approach: the computational methods to tackle the problem.

• Results: the impact of this work on scientific research. An outline of possible extensions of my methods and some interesting ideas to examine appear in the section on future direction.

7.1 Finding Orthologous Gene Blocks in Bacteria: the Computational Hardness of the Problem and Novel Methods to Address It

7.1.1 Discussions

• Problem. In Chapter4, we presented the problem of finding orthologous gene blocks in bacteria using the event-based method, which has been an established and useful method in

measuring the evolutionary changes of gene block in bacteria (Ream et al., 2015). Although

the problem was studied previously, there has been no formal study of its complexity. The 86

previous work introduced a heuristic approach that provided neither running time nor opti-

mality of its results.

• Approach. To examine the complexity of the problems, we formalized them in chapter3. Under these formalizations, we proved that the problem was intrinsically hard (NP-Hard).

This means that there are likely no polynomial algorithms to solve the problem exactly.

Furthermore, we showed that the problem was hard to approximate (APX-Hard) within ln n

of the optimal result. Then we presented my greedy approach and proved it was a ln n

approximation algorithm. In order to compare my approach and the previous approach,

we formalized an Integer Linear Program of our problem and solved it exactly using the

CPLEX algorithm. The results showed that my method was faster and more accurate than

the previous approach.

• Results. The article describes the first complexity study of finding orthologous gene blocks in bacteria. It provided insight on how hard the problem was, which helped us not pursuing

to solve the problem exactly and guided us to other alternatives. By exploring other alterna-

tives approaches such as approximation, we came up with a polynomial-time algorithm that

provided a lower bound on the results. This work showed that even the simplified version of a

biological problem could be extremely hard to solve. However, one should not be discouraged

since there are approaches to solve it exactly with small data sets and to approximate it with

large data sets.

7.1.2 Future directions

In the article, we tackled a relaxed version of the original problem, which was already hard.

There are several approaches to solve such problems: approximation algorithms, fixed-parameter tractable algorithms, randomized algorithms, etc. Fixed-parameter algorithms identify a specific parameter in intractable problems to be fixed so that the algorithms can solve it efficiently while the parameters are small. However, since our problem was reducible to set cover and set cover is W[2]-hard (Downey and Fellows, 2012), our problem is probably not fixed-parameter tractable. 87

Proof for such hardness can prevent us from wasting our effort in trying to fix the parameter in our problem. On the other hand, randomized algorithms can be a promising approach. There are several well known randomized algorithms such as Quicksort algorithm (Hoare, 1961) and

Karger algorithm (Karger, 1999). Interestingly, our cost function in problem3 is a non-monotonic modular function. It means that for each set A and set B that are subsets of our gene set S, then f(S, A) + f(S, B) = f(S, A ∪ B) + f(S, A ∩ B) . Hence, we can use either the non-adaptive randomized algorithm or the adaptive randomized algorithm to provide a constant approximation for our problem (Feige et al., 2011). In future work, we can compare those approaches with the algorithms presented in Chapter2.

Since our relaxed version is NP-Hard, the original problem should also be hard to solve exactly.

One interesting question is whether our greedy approach can provide any lower bound guarantee for the original problem. Another question is whether the cost function that incorporates duplication cost is still modular. If it is, devising appropriate randomized algorithms to solve the problem is interesting research. Besides, we can also redefine our problem, such that the objective function not only minimizes the evolutionary cost between the reference gene block and the query gene block but also minimizes the total evolutionary costs between any two pair orthologous gene blocks from our group of species.

Lastly, I propose to run our methods on simulated data. The current data set in our article is taken from the previous paper by (Ream et al., 2015). Although our greedy algorithm is about 100 times faster than the previous method, it only takes the previous approach about 10−5 seconds to run. Therefore, it would be interesting to devise simulation schemes to generate our data set so that they can showcase scenarios where one of the methods excels and other suffers. Once we have several different data sets, we can better evaluate different approaches. 88

7.2 Tracing the Ancestry of Operons in Bacteria

7.2.1 Discussions

• Problem. In chapter5, we presented the problem of reconstructing the ancestral gene block in bacteria using the event-based method. Although ancestral reconstruction is a well-known

problem, there has been no research done in reconstructing gene blocks in bacteria.

• Approach. In our work, we presented maximum parsimony approaches to reconstruct the ancestral state of the gene blocks. Under this assumption, we developed two methods: local

method and global method. The local method optimized the evolutionary changes between

any pair of nodes and its direct descendants, while the global method optimized the total

sum of evolutionary changes of the phylogenetic tree.

• Results. Using the two methods, we reconstructed the ancestral gene blocks of two groups of species: Gram-negative and Gram-positive bacteria. In Gram-negative bacteria, we used

E. coli as our reference taxon and reconstructed ancestral states of 55 operons from E. coli.

In Gram-positive bacteria, we used B. subtilis as our reference taxon and reconstructed an-

cestral states of 87 operons from B. subtilis. From the reconstructions, we observed several

interesting trends of gene blocks emerge together. We also noticed a couple of core gene

clusters that dated back to the most common ancestors and intermediate functional forms of

orthoblocks. We found that the formation of a protein complex was the main force for gene

block conservation. Regarding our two methods, we found that the global approach gave

lower accumulative evolutionary changes. We also proved that our two methods guaranteed

the minimum deletion costs and duplication costs in their objective functions. Our ROAGUE

software was the first software to interrogate the evolution of gene blocks in bacteria using

the event-based model. The tool provided a fast way to trace the ancestral gene block and

beautiful visualization of the process. We hope ROAGUE can encourage more researchers to

study the evolution of gene block in bacteria. 89

7.2.2 Future directions

Unlike the small parsimony problem (Fitch, 1971), our local ancestral reconstruction problem is quite hard. However, we were not able to elucidate its hardness with a formal reduction to an

NP-Hard problem. Hence, it would be very beneficial to provide NP-Hardness proof for both local and global approaches. Currently, our approaches only guaranteed minimum deletion costs and duplication costs separately. The NP-Hardness reduction can direct us to a potential approximation or randomization scheme. Besides, our local approach employed a correction mechanism that prevented a gene from being greedily propagated into the tree more than it should be. A study of how this affects the lower bound of the total deletion costs could add more advantages to the local approach, as it was at least twice as fast as the global approach.

ROAGUE software did not take into account horizontal gene transfer and was agnostic to gene order. HGT and gene orders are vital forces of gene block evolution in some species (Fang et al.,

2008; Lawrence and Roth, 1996). Therefore, it would be a great advantage to incorporate those into our models. Regarding HGT, there are two main approaches: the parametric approach and the phylogenetic approach. Parametric approaches compute the GC content for a sliding window and compare it over a specific range over the genome sequence (Nguyen et al., 2015). The fragments that are profoundly different from the genomic signature are flagged as potential horizontal gene transfer regions. On the other hand, phylogenetic approaches identify the inconsistencies between gene trees and the species tree. By cutting an edge (pruning) and regrafting it to another edge, phylogenetic approaches reconcile the species tree with the gene trees and infer possible HGT nodes

(Hallett and Lagergren, 2001).

The second approach is more appropriate to incorporate into ROAGUE. For example, given node u and its direct descendant u1, u2, we might consider the scenario of having gene g in u1 or u2 is the result of HGT events instead of assuming node u should have gene g. This can be done by adding HGT event to our event-based model. Regarding gene order, one possible approach is to consider rearrangement events in a gene block. We can quantify the rearrangement cost of two blocks by finding the smallest number of steps to rearrange one block to have the same gene 90 order as the other block. Personally, the HGT and gene order elements will provide a more general perspective in the evolution of gene block in bacteria.

Last but not least, I propose to use the maximum likelihood model to reconstruct the ancestral gene blocks in bacteria. In order to use the ML model, it is necessary to compute the likelihood of one gene block evolves to another. Unlike the nucleotide or the amino acid substitutions model where all possible cases are known, we could not generate all the possible gene blocks given the gene set in the reference operon/ gene block. Therefore, we could use our three evolutionary events as the state of transitions. Furthermore, I suggest using the normalized distance matrix (Ream et al., 2015) to compute the transitional probability between each pair of states. I believe this model could provide a better understanding of the evolution of gene blocks as it takes into account the branch length as it can provide an estimation of our reconstruction.

7.3 Unraveling a Tangled Skein: Evolutionary Analysis of the Bacterial Gibberellin Biosynthesis Operon

7.3.1 Discussions

• Problem. In chapter6, we studied the evolution of the bacterial gibberellin biosynthetic operon. Although gibberellin phytohormones have been studied more than a century, we

only know a little about gibberellin biosynthesis in bacteria.

• Approach. We extended our ROAGUE method to study the evolution of bacterial gibberellin operon. Our software implemented a pipeline to retrieve all contigs from the NCBI database

given a list of species names and annotated them using prokka (Seemann, 2014). Besides,

ROAGUE enabled the detection of pseudo-genes cyp115, and added an ad-hoc procedure

to identify the fusion of gene fdGA into gene sdrGA or gene cyp114. Combining with the experimental study of the genes ids2 and ggps2, we employed our ROAGUE software to

reconstruct the ancestral gene blocks of GA operon within alphaproteobacteria rhizobia

• Results. Using ROAGUE, we were able to characterize the loss, gain, and HGT of GA operon genes within alphaproteobacteria rhizobia. In addition to species tree, we built an operon tree 91

using the concatenated protein sequences of the core GA operon in order to capture the HGT

events. The software shed light on how the full-length gene cyp115 was lost and was regained

independently in three different lineages, which further supports a potential HGT event. The

full-length gene cyp115 is essential for GA biosynthesis in bacteria because it catalyzes the

final step in bioactive GA4 (Nagel and Peters, 2017; Nagel et al., 2017; Nett et al., 2017a). The analyses using GC content and ROAGUE suggested that HGT happened as the GA operon

evolved in plant-associated bacteria. In addition, the intermediate forms of gibberellin (lack

of gene cyp115 ) indicated that the production of GA might not be advantageous to all the

proteobacteria rhizobia. These results showcase the power of ROAGUE in analyzing the

evolution of gene block in bacteria.

7.3.2 Future directions

In the paper, we built the operon tree using a naive method, which concatenated the alignments of protein sequences of the core GA operon. In the future, we can apply Duplication-Transfer-Loss

(DTL) reconciliation to build on the operon tree. There are several implementations of this model that can be readily applied (Bansal et al., 2012; Kordi and Bansal, 2017). This can also help with the HGT analysis of the GA operon.

Regarding the fusion detection, our current method only provided an ad-hoc mechanism. One potential approach is to use GeneFusion to detect the gene fusion before finding orthologous gene block (Chen et al., 2018). GeneFusion is a practical method to detect gene fusion with high sensitivity. The software also provides an interactive visualization of the result that provides quality of the predictions, the coordinated of the fusion genes, etc. However, integrating this software into our pipeline might be a challenge. On the other hand, I propose to detect gene fusion while finding orthologous genes. This can be done by sorting orthologous candidates gene by their start codon and resolve overlapping genes. This approach can be readily employed in our pipeline without using external software. 92

Finally, I suggest applying ROAGUE software to study the evolution of GA operon in fungi and plants. Although our model was initially established to study gene block evolution in bacteria, it would be beneficial to utilize it for other species. One of the challenges would be how we quantify the evolutionary changes in gene block in fungi and plants. If we could find appropriate models for those two organisms, it would be possible to understand how GA operons evolve in a broader aspect. 93

BIBLIOGRAPHY

Acharya, R. (2009). Overexpression, purification, and characterization of MmgD from Bacillus subtilis strain 168. PhD thesis, University of North Carolina at Greensboro.

Ballouz, S., Francis, A. R., Lan, R., and Tanaka, M. M. (2010). Conditions for the evolution of gene clusters in bacterial genomes. PLoS Computational Biology, 6(2).

Bansal, M. S., Alm, E. J., and Kellis, M. (2012). Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics, 28(12):i283–i291.

Beckwith, J. R., Pardee, A. B., Austrian, R., and Jacob, F. (1962). Coordination of the synthesis of the enzymes in the pyrimidine pathway of e. coli. Journal of molecular biology, 5(6):618–634.

Belfaiza, J., Parsot, C., Martel, A., De La Tour, C. B., Margarita, D., Cohen, G. N., and Saint- Girons, I. (1986). Evolution in biosynthetic pathways: two enzymes catalyzing consecutive steps in methionine biosynthesis originate from a common ancestor and possess a similar regulatory region. Proceedings of the National Academy of Sciences, 83(4):867–871.

Chen, S., Liu, M., Huang, T., Liao, W., Xu, M., and Gu, J. (2018). Genefuse: detection and visualization of target gene fusions from dna sequencing data. International journal of biological sciences, 14(8):843.

Chernomor, O., Minh, B. Q., Forest, F., Klaere, S., Ingram, T., Henzinger, M., and von Hae- seler, A. (2015). Split diversity in constrained conservation prioritization using integer linear programming. Methods in Ecology and Evolution, 6(1):83–91.

Chvatal, V. (1979). A greedy heuristic for the set-covering problem. Mathematics of Operations Research, 4(3):233–235.

Demerec, M. and Hartman, P. E. (1959). Complex loci in microorganisms. Annual Reviews in Microbiology, 13(1):377–406.

Dobrindt, U., Hochhut, B., Hentschel, U., and Hacker, J. (2004). Genomic islands in pathogenic and environmental microorganisms. Nature Reviews Microbiology, 2(5):414–424.

Downey, R. G. and Fellows, M. R. (2012). Parameterized complexity. Springer Science & Business Media.

Eisen, J. A. (1998). A phylogenomic study of the muts family of proteins. Nucleic Acids Research, 26(18):4291–4300. 94

Enault, F., Suhre, K., Abergel, C., Poirot, O., and Claverie, J.-M. (2003). Annotation of bacterial genomes using improved phylogenomic profiles. Bioinformatics, 19(suppl 1):i105–i107.

Fang, G., Rocha, E. P., and Danchin, A. (2008). Persistence drives gene clustering in bacterial genomes. BMC Genomics, 9(1):4.

Fani, R., Brilli, M., and Li`o,P. (2005). The origin and evolution of operons: the piecewise building of the proteobacterial histidine operon. Journal of Molecular Evolution, 60(3):378–390.

Farris, J. S. (1977). Phylogenetic analysis under dollo’s law. Systematic Biology, 26(1):77–88.

Feige, U., Mirrokni, V. S., and Vondr´ak,J. (2011). Maximizing non-monotone submodular func- tions. SIAM Journal on Computing, 40(4):1133–1153.

Fitch, W. M. (1971). Toward defining the course of evolution: minimum change for a specific tree topology. Systematic Biology, 20(4):406–416.

Freiberg, C., Fellay, R., Bairoch, A., Broughton, W. J., Rosenthal, A., and Perret, X. (1997). Molecular basis of symbiosis between rhizobium and legumes. Nature, 387(6631):394–401.

Gonz´alez,V., Bustos, P., Ram´ırez-Romero, M. A., Medrano-Soto, A., Salgado, H., Hern´andez- Gonz´alez,I., Hern´andez-Celis,J. C., Quintero, V., Moreno-Hagelsieb, G., Girard, L., et al. (2003). The mosaic structure of the symbiotic plasmid of rhizobium etlicfn42 and its relation to other symbiotic genome compartments. Genome Biology, 4(6):R36.

G¨ottfert,M., R¨othlisberger, S., K¨undig, C., Beck, C., Marty, R., and Hennecke, H. (2001). Potential symbiosis-specific genes uncovered by sequencing a 410-kilobase dna region of the bradyrhizobium japonicum chromosome. Journal of Bacteriology, 183(4):1405–1412.

Grishin, A. M., Ajamian, E., Tao, L., Zhang, L., Menard, R., and Cygler, M. (2011). Structural and functional studies of the escherichia coli phenylacetyl-coa monooxygenase complex. Journal of Biological Chemistry, 286(12):10735–10743.

Hallett, M. T. and Lagergren, J. (2001). Efficient algorithms for lateral gene transfer problems. In Proceedings of The fifth annual international conference on Computational biology, pages 149– 156.

Harms, M. J. and Thornton, J. W. (2010). Analyzing protein structure and function using ancestral gene reconstruction. Current Opinion in Structural Biology, 20(3):360–366.

Hayashi, S., Gresshoff, P. M., and Ferguson, B. J. (2014). Mechanistic action of gibberellins in legume nodulation. Journal of Integrative Plant Biology, 56(10):971–978.

Hedden, P. and Sponsel, V. (2015). A century of gibberellin research. Journal of Plant Growth Regulation, 34(4):740–760. 95

Hershey, D. M., Lu, X., Zi, J., and Peters, R. J. (2014). Functional conservation of the capacity for ent-kaurene biosynthesis and an associated operon in certain rhizobia. Journal of Bacteriology, 196(1):100–106.

Hiroe, A., Tsuge, K., Nomura, C. T., Itaya, M., and Tsuge, T. (2012). Rearrangement of gene order in the phacab operon leads to effective production of ultrahigh-molecular-weight poly[(r)- 3-hydroxybutyrate] in genetically engineered escherichia coli. Applied and Environmental Micro- biology, 78(9):3177–3184.

Hoare, C. A. R. (1961). Algorithm 64: quicksort. Communications of the ACM, 4(7):321.

Homuth, G., Masuda, S., Mogk, A., Kobayashi, Y., and Schumann, W. (1997). The dnak operon of bacillus subtilis is heptacistronic. Journal of Bacteriology, 179(4):1153–1164.

Horowitz, N. H. (1945). On the evolution of biochemical syntheses. Proceedings of the National Academy of Sciences of the United States of America, 31(6):153.

Hou, X., Lee, L. Y. C., Xia, K., Yan, Y., and Yu, H. (2010). Dellas modulate jasmonate signaling via competitive binding to jazs. Developmental Cell, 19(6):884–894.

Ismail, W., El-Said Mohamed, M., Wanner, B. L., Datsenko, K. A., Eisenreich, W., Rohdich, F., Bacher, A., and Fuchs, G. (2003). Functional genomics by nmr spectroscopy. European Journal of Biochemistry, 270(14):3047–3054.

Jacob, F. and Monod, J. (1961). On the regulation of gene activity. In Cold Spring Harbor Symposia on Quantitative Biology, volume 26, pages 193–211, Cold Spring Harbor, NY. Cold Spring Harbor Laboratory Press.

Kaneko, T., Nakamura, Y., Sato, S., Minamisawa, K., Uchiumi, T., Sasamoto, S., Watanabe, A., Idesawa, K., Iriguchi, M., Kawashima, K., et al. (2002). Complete genomic sequence of nitrogen- fixing symbiotic bacterium bradyrhizobium japonicum usda110. DNA Research, 9(6):189–197.

Karger, D. R. (1999). Random sampling in cut, flow, and network design problems. Mathematics of Operations Research, 24(2):383–413.

Kasimoglu, E., Park, S.-J., Malek, J., Tseng, C. P., and Gunsalus, R. P. (1996). Transcriptional regulation of the proton-translocating atpase (atpibefhagdc) operon of escherichia coli: control by cell growth rate. Journal of Bacteriology, 178(19):5563–5567.

Koonin, E. V. (2009). Evolution of genome architecture. The International Journal of Biochemistry & Cell Biology, 41(2):298–306.

Koonin, E. V., Makarova, K. S., and Aravind, L. (2001). Horizontal gene transfer in prokaryotes: quantification and classification. Annual Review of Microbiology, 55(1):709–742. 96

Kordi, M. and Bansal, M. S. (2017). Exact algorithms for duplication-transfer-loss reconciliation with non-binary gene trees. IEEE/ACM Transactions on Computational Biology and Bioinfor- matics.

Koumandou, V. L. and Kossida, S. (2014). Evolution of the f 0 f 1 atp synthase complex in light of the patchy distribution of different bioenergetic pathways across prokaryotes. PLoS Comput Biol, 10(9):e1003821.

Lawrence, J. G. and Roth, J. R. (1996). Selfish operons: horizontal transfer may drive the evolution of gene clusters. Genetics, 143(4):1843–1860.

Levy, A., Gonzalez, I. S., Mittelviefhaus, M., Clingenpeel, S., Paredes, S. H., Miao, J., Wang, K., Devescovi, G., Stillman, K., Monteiro, F., et al. (2018). Genomic features of bacterial adaptation to plants. Nature Genetics, 50(1):138–150.

Lim, H. N., Lee, Y., and Hussein, R. (2011). Fundamental relationship between operon organization and gene expression. Proceedings of the National Academy of Sciences, 108(26):10626–10631.

Lu, X., Hershey, D. M., Wang, L., Bogdanove, A. J., and Peters, R. J. (2015). An ent-kaurene- derived diterpenoid virulence factor from x anthomonas oryzae pv. oryzicola. New Phytologist, 206(1):295–302.

Luc, N., Risler, J.-L., Bergeron, A., and Raffinot, M. (2003). Gene teams: a new formalization of gene clusters for comparative genomics. Computational Biology and Chemistry, 27(1):59–67.

Luengo, J. M., Garcia, J. L., and Olivera, E. R. (2001). The phenylacetyl-coa catabolon: a complex catabolic unit with broad biotechnological applications. Molecular Microbiology, 39(6):1434–1442.

Marcotte, E. M., Pellegrini, M., Ng, H.-L., Rice, D. W., Yeates, T. O., and Eisenberg, D. (1999). Detecting protein function and protein-protein interactions from genome sequences. Science, 285(5428):751–753.

Martin, F. J. and McInerney, J. O. (2009). Recurring cluster and operon assembly for phenylacetate degradation genes. BMC Evolutionary Biology, 9(1):36.

McAdam, E. L., Reid, J. B., and Foo, E. (2018). Gibberellins promote nodule organogenesis but inhibit the infection stages of nodulation. Journal of experimental botany, 69(8):2117–2130.

M´endez,C., Baginsky, C., Hedden, P., Gong, F., Car´u,M., and Rojas, M. C. (2014). Gibberellin oxidase activities in bradyrhizobium japonicum bacteroids. Phytochemistry, 98:101–109.

Nagel, R., Bieber, J. E., Schmidt-Dannert, M. G., Nett, R. S., and Peters, R. J. (2018). A third class: Functional gibberellin biosynthetic operon in beta-proteobacteria. Frontiers in Microbiology, 9:2916. 97

Nagel, R. and Peters, R. J. (2017). Investigating the phylogenetic range of gibberellin biosynthesis in bacteria. Molecular Plant-Microbe Interactions, 30(4):343–349.

Nagel, R., Turrini, P. C., Nett, R. S., Leach, J. E., Verdier, V., Van Sluys, M.-A., and Peters, R. J. (2017). An operon for production of bioactive gibberellin a4 phytohormone with wide distribution in the bacterial rice leaf streak pathogen xanthomonas oryzae pv. oryzicola. New Phytologist, 214(3):1260–1266.

Navarro, L., Bari, R., Achard, P., Lis´on,P., Nemri, A., Harberd, N. P., and Jones, J. D. (2008). Dellas control plant immune responses by modulating the balance of jasmonic acid and salicylic acid signaling. Current Biology, 18(9):650–655.

Nett, R. S., Contreras, T., and Peters, R. J. (2017a). Characterization of cyp115 as a gibberellin 3-oxidase indicates that certain rhizobia can produce bioactive gibberellin a4. ACS Chemical Biology, 12(4):912–917.

Nett, R. S., Montanares, M., Marcassa, A., Lu, X., Nagel, R., Charles, T. C., Hedden, P., Rojas, M. C., and Peters, R. J. (2017b). Elucidation of gibberellin biosynthesis in bacteria reveals convergent evolution. Nature Chemical Biology, 13(1):69.

Nguyen, H. N., Jain, A., Eulenstein, O., and Friedberg, I. (2019). Tracing the ancestry of operons in bacteria. Bioinformatics, 35(17):2998–3004.

Nguyen, M., Ekstrom, A., Li, X., and Yin, Y. (2015). Hgt-finder: A new tool for horizontal gene transfer finding and application to aspergillus genomes. Toxins, 7(10):4035–4053.

NJ, G. (1984). Construction and characterization of an escherichia coli strain with a unci mutation. Journal of Bacteriology, 158:820–825.

Nogales, J., Macchi, R., Franchi, F., Barzaghi, D., Fernandez, C., Garcia, J. L., Bertoni, G., and Diaz, E. (2007). Characterization of the last step of the aerobic phenylacetic acid degradation pathway. Microbiology, 153(2):357–365.

Ogata, H., Fujibuchi, W., Goto, S., and Kanehisa, M. (2000). A heuristic graph comparison algo- rithm and its application to detect functionally related enzyme clusters. Nucleic Acids Research, 28(20):4021–4028.

Oldroyd, G. E., Murray, J. D., Poole, P. S., and Downie, J. A. (2011). The rules of engagement in the legume-rhizobial symbiosis. Annual Review of Genetics, 45:119–144.

Omelchenko, M., Makarova, K., Wolf, Y., Rogozin, I., and Koonin, E. (2003). Evolution of mosaic operons by horizontal gene transfer and gene displacement in situ. Genome Biology, 4(9):R55+. 98

Overbeek, R., Fonstein, M., D’souza, M., Pusch, G. D., and Maltsev, N. (1999). The use of gene clusters to infer functional coupling. Proceedings of the National Academy of Sciences of the United States of America, 96(6):2896–2901.

Pardee, A. B., Jacob, F., and Monod, J. (1959). The genetic control and cytoplasmic expression of “inducibility” in the synthesis of β-galactosidase by e. coli. Journal of Molecular Biology, 1(2):165–178.

Pauling, L., Zuckerkandl, E., and Henriksen, T. (1963). Chemical paleogenetics. Acta chem. scand, 17:S9–S16.

Pueppke, S. G. and Broughton, W. J. (1999). Rhizobium sp. strain ngr234 and r. fredii usda257 share exceptionally broad, nested host ranges. Molecular Plant-Microbe Interactions, 12(4):293– 318.

Pupko, T., Pe, I., Shamir, R., and Graur, D. (2000). A fast algorithm for joint reconstruction of ancestral amino acid sequences. Molecular Biology and Evolution, 17(6):890–896.

Quattlebaum, A. L. (2009). Characterization of Biosynthetic and Catabolic Pathways of Bacillus Subtilis Strain 168. PhD thesis, University of North Carolina at Greensboro.

Ream, D. C., Bankapur, A. R., and Friedberg, I. (2015). An event-driven approach for studying gene block evolution in bacteria. Bioinformatics, 31(13):2075–2083.

Schulz, T., Stoye, J., and Doerr, D. (2018). Graphteams: a method for discovering spatial gene clusters in hi-c sequencing data. BMC Genomics, 19(5):308.

Schwechheimer, C. (2012). Gibberellin signaling in plants–the extended version. Frontiers in Plant Science, 2:107.

Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14):2068– 2069.

Senior, A. E. (1990). The proton-translocating atpase of escherichia coli. Annual Review of Bio- physics and Biophysical Chemistry, 19(1):7–41. PMID: 2141983.

Srinivasan, B. S., Caberoy, N. B., Suen, G., Taylor, R. G., Shah, R., Tengra, F., Goldman, B. S., Garza, A. G., and Welch, R. D. (2005). Functional genome annotation through phylogenomic mapping. Nature Biotechnology, 23(6):691–698.

Stahl, F. W. and Murray, N. E. (1966). The evolution of gene clusters and genetic circularity in microorganisms. Genetics, 53(3):569.

Sullivan, J. T., Trzebiatowski, J. R., Cruickshank, R. W., Gouzy, J., Brown, S. D., Elliot, R. M., Fleetwood, D. J., McCallum, N. G., Rossbach, U., Stuart, G. S., et al. (2002). Comparative 99

sequence analysis of the symbiosis island of mesorhizobium loti strain r7a. Journal of Bacteriology, 184(11):3086–3095.

Swofford, D. L. and Maddison, W. P. (1987). Reconstructing ancestral character states under wagner parsimony. Mathematical Biosciences, 87(2):199–229.

Syvanen, M. (1994). Horizontal gene transfer: evidence and possible consequences. Annual Review of Genetics, 28(1):237–261.

Tatsukami, Y. and Ueda, M. (2016). Rhizobial gibberellin negatively regulates host nodule number. Scientific Reports, 6:27998.

Teufel, R., Mascaraque, V., Ismail, W., Voss, M., Perera, J., Eisenreich, W., Haehnel, W., and Fuchs, G. (2010). Bacterial phenylalanine and phenylacetate catabolic pathway revealed. Pro- ceedings of the National Academy of Sciences, 107(32):14390–14395.

Tully, R., van Berkum, P., Lovins, K., and Keister, D. (1998). Identification and sequencing of a cytochrome p450 gene cluster from bradyrhizobium japonicum. Biochimica et Biophysica Acta (BBA)-Gene Structure and Expression, 1398(3):243–255.

Uchiumi, T., Ohwada, T., Itakura, M., Mitsui, H., Nukui, N., Dawadi, P., Kaneko, T., Tabata, S., Yokoyama, T., Tejima, K., et al. (2004). Expression islands clustered on the symbiosis island of the mesorhizobium loti genome. Journal of Bacteriology, 186(8):2439–2448.

Ueguchi-Tanaka, M., Ashikari, M., Nakajima, M., Itoh, H., Katoh, E., Kobayashi, M., Chow, T.-y., Yue-ie, C. H., Kitano, H., Yamaguchi, I., et al. (2005). Gibberellin insensitive dwarf1 encodes a soluble receptor for gibberellin. Nature, 437(7059):693–698.

Voigt, B., Hoi, L. T., J¨urgen,B., Albrecht, D., Ehrenreich, A., Veith, B., Evers, S., Maurer, K.-H., Hecker, M., and Schweder, T. (2007). The glucose and nitrogen starvation response of bacillus licheniformis. Proteomics, 7(3):413–423.

Wells, J. N., Bergendahl, L. T., and Marsh, J. A. (2016). Operon gene order is optimized for ordered protein complex assembly. Cell reports, 14(4):679–685.

Wetzstein, M., V¨olker, U., Dedio, J., L¨obau,S., Zuber, U., Schiesswohl, M., Herget, C., Hecker, M., and Schumann, W. (1992). Cloning, sequencing, and molecular analysis of the dnak locus from bacillus subtilis. Journal of Bacteriology, 174(10):3300–3310.

Wiemann, P., Sieber, C. M., Von Bargen, K. W., Studt, L., Niehaus, E.-M., Espino, J. J., Huss, K., Michielse, C. B., Albermann, S., Wagner, D., et al. (2013). Deciphering the cryptic genome: genome-wide analyses of the rice pathogen fusarium fujikuroi reveal complex regulation of sec- ondary metabolism and novel metabolites. PLoS Pathogens, 9(6). 100

Williams, P. D., Pollock, D. D., Blackburne, B. P., and Goldstein, R. A. (2006). Assessing the accuracy of ancestral protein reconstruction methods. PLoS Computational Biology, 2(6).

Wolf, Y. I., Rogozin, I. B., Kondrashov, A. S., and Koonin, E. V. (2001). Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. Genome Research, 11(3):356–372.

Yamaguchi, S. (2008). Gibberellin metabolism and its regulation. Annual Review of Plant Physi- ology, 59:225–251.

Yang, Z. (2007). Paml 4: phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution, 24(8):1586–1591.

Yang, Z., Kumar, S., and Nei, M. (1995). A new method of inference of ancestral nucleotide and amino acid sequences. Genetics, 141(4):1641–1650.

Zi, J., Mafu, S., and Peters, R. J. (2014). To gibberellins and beyond! surveying the evolution of (di) terpenoid metabolism. Annual Review of Plant Biology, 65:259–286.

Zuckerkandl, E. and Pauling, L. (1965). Evolutionary divergence and convergence in proteins. In Evolving genes and proteins, pages 97–166. Elsevier, New York, NY.