Imperial College London

London Institute of Medical Sciences

Patterns of Horizontal Gene Transfer into the Clade

Alexander Dmitriyevich Esin

September 2018

Submitted in part fulfilment of the requirements for the degree of

Doctor of Philosophy of Imperial College London For my grandmother, Marina. Without you I would have never been on this path. Your unwavering strength, love, and fierce intellect inspired me from childhood and your memory will always be with me.

2 Declaration

I declare that the work presented in this submission has been undertaken by me, including all analyses performed. To the best of my knowledge it contains no material previously published or presented by others, nor material which has been accepted for any other degree of any university or other institute of higher learning, except where due acknowledgement is made in the text.

3 The copyright of this thesis rests with the author and is made available under a Creative Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear to others the licence terms of this work.

4 Abstract

Horizontal gene transfer (HGT) is the major driver behind rapid bacterial adaptation to a host of diverse environments and conditions. Successful HGT is dependent on overcoming a number of barriers on transfer to a new host, one of which is adhering to the adaptive architecture of the recipient genome. My aim was to investigate how HGT gain is spatially patterned, both on arrival and in long term maintenance. I chose to focus on HGT into a model group of , that includes Geobacillus spp., not only to avoid ambiguity associated with aggregate analyses, but also because observed biases could enhance our ability to engineer this emerging chassis. In this thesis I first present my methodology for detecting HGT into a specific group; I augmented existing approaches to improve computational tractability while deriving a stringent set of horizontally transferred (HT) gene pre- dictions. In the second results chapter, I assess the predicted HGTs in the context of previous work to find that they are highly consistent, justifying my detection approach. Finally, I dissect the topology of HT genes across Geobacillus genomes to find three large zones of contiguous HGT enrichment. I find that this patterning is driven by gene function, with metabolic genes clustering towards the terminus. Interestingly, the HGT-rich origin-proximal zones, home to many HT genes involved in membrane biogenesis, overlap with the section of the chromosome trapped within the nascent endospore during sporulation. Similar functional enrichment patterns are found in other spore-forming , but not those unable to sporulate. This suggests that HGT flow into Geobacillus genomes may be spatially constrained, at

5 least in part, by the sporulation program. In the final chapter, I discuss the im- plications of this research in a bioengineering context, and suggest possible future directions to confirm the link between HGT topology and sporulation.

6 Acknowledgements

First and foremost, I must thank my supervisor Toby. You provided me with this opportunity to study for a PhD with you, despite me wearing a suit to my interview. I could not have been more lucky with a supervisor and mentor. Thank you for your astute insights and your always timely reminders to stop focusing on the small things and look at the bigger picture. Your immeasurable patience with me throughout the four years has been nothing short of extraordinary.

The Molecular Systems group - thank you for all your support and days of cumulative patience hearing out my, often confusing, ramblings. Thank you Jelena for always being ready to help. Maria, Shivani, and Val - despite making me explode with kittens, you have been awesome and always there for me, thank you. Antoine and Jake, you have brought so much energy to the lab - keep the Costa del Cutter tradition strong and be the wizards you know you can be. Voracious Vas, thank you for always keeping an eye out for me, even if it led to many awkward stares - you are now in charge of the charm!

Thank you also to everyone in our neighbouring CRG group. From winning the rounders trophy as the “Boris’ Babes”, to my frequent and unashamed theft of your coffee, you’ve made the last four years so much better. Leonie, the figures in this thesis are dedicated to you. Without your mentorship on all things colourful this publication would not look half as good, thank you.

7 Tom Ellis and members of Ellis Lab, thank you for your many keen observations and suggestions, and for always making me feel like part of the group.

Hugh, Julian, Ketan - my friends and, most importantly, partners in tea and spades. It is an incredible privilege to know that you are always there, whether for a walk, a chat, or an eerily lifelike impersonation of a certain aquatic mammal!

The finest co-workers I could have asked for, Alex and James. It just would have never been the same without you. For the games, the drinks, the risotto, the all-nighters arguing over CAPS lock, the snakes, and the krabs. Life may take us sideways, but I will always find you in the oven.

My family. There is no warmer embrace to come back to than yours. Nicholas and Catherine, you have never failed to brighten my day, even when I did not know I needed it. Max, you have been an incredible and loving force driving me to succeed since I can remember; you have taught me so much about so many things, I could not be more proud to be your son. Mama, you, more than anyone, have made all of this possible. You have done so much for me that gratitude cannot be put into words. I love you all.

Alena, you have been my rock. Through my ups, and my downs, my dis- tractions, and my laziness you have encouraged, supported, humoured, and pushed me. From the time I applied, to the end of my PhD, you have taken every step with me and I cannot imagine this journey without you. Stay magical, my little Zebra.

8 Contents

Abstract 5

Acknowledgements 7

List of Figures 12

List of Tables 14

Abbreviations 15

List of Publications 17

1 Introduction 18

1.1 Gene turnover in bacterial genomes ...... 19 1.2 HGT in evolution: preference to pattern ...... 20 1.2.1 Function ...... 21 1.2.2 Form follows function ...... 23 1.2.3 HGT: a matter of time ...... 27 1.2.4 Space: the final frontier ...... 29 1.3 Perspectives for ...... 33 1.4 The GPA clade ...... 36 1.5 Aims of this thesis ...... 39

2 Clade-specific HGT inference 41

9 2.1 Introduction ...... 42 2.1.1 Compositional and phylogenetic approaches ...... 42 2.1.2 Specific steps in phylogenetic analysis ...... 47 2.1.3 Chapter aims ...... 55 2.2 Methods ...... 57 2.3 Results ...... 60 2.3.1 Two genomic datasets ...... 60 2.3.2 Genes to proteins, proteins to homologues ...... 61 2.3.3 Homologues to gene families ...... 62 2.3.4 Gene families to gene trees ...... 67 2.3.5 Reference tree reconstruction and local correction ...... 68 2.3.6 HGT inference with Mowgli ...... 72 2.3.7 Refining HGT and vertical gene history predictions ...... 75 2.3.8 Determining the age of transfer ...... 79 2.3.9 RAxML vs FastTree ...... 81 2.4 Discussion ...... 83

3 Assessment of HGT into GPA 84

3.1 Introduction ...... 85 3.1.1 Function and nucleotide content as hallmarks ...... 85 3.1.2 Temporal patterns of HGT into GPA ...... 86 3.1.3 Sources of genes transferred into GPA ...... 87 3.2 Methods ...... 88 3.3 Results ...... 92 3.3.1 Function of HT and vertical genes ...... 92 3.3.2 Nucleotide content of HT and vertical genes ...... 95 3.3.3 Temporal dynamics of gene flow into GPA ...... 99 3.3.4 Sources of HGT into GPA ...... 102 3.4 Discussion ...... 104

10 4 Topology of HT genes in GPA genomes 109

4.1 Introduction ...... 110 4.2 Methods ...... 111 4.3 Results ...... 118 4.3.1 Gene history spatially patterns GPA genomes ...... 118 4.3.2 Inversions around origin-terminus axis drive symmetry . . . . 121 4.3.3 HGT zones encompass genomic islands ...... 122 4.3.4 HGT zones are equally selective of incoming genes ...... 126 4.3.5 HGT zones are patterned by gene function ...... 129 4.3.6 The Near Origin zone and sporulation ...... 134 4.4 Discussion ...... 137

5 Discussion 143

5.1 Clade-specific HGT inference ...... 145 5.2 Assessment of HGT into GPA ...... 148 5.3 Topology of HT genes in GPA genomes ...... 151 5.4 Evolutionary lessons for synthetic biology ...... 155 5.5 Future directions ...... 157 5.5.1 HGT domain topology across prokaryotes ...... 157 5.5.2 Spore-specific genetic innovation in GPA clade ...... 159

References 161

Appendix A Supplementary Figures 186

Appendix B Supplementary Tables 188

Appendix C Other Publications 192

11 List of Figures

1.1 The GPA clade in the Bacillaceae family ...... 37

2.1 Phylogenetic context in HGT inference ...... 49 2.2 Clustering of paralogues ...... 63 2.3 Gene family size reduction ...... 66 2.4 Reference tree refinement ...... 71 2.5 GPA-to-GPA transfers and refining the recipient branch ...... 75 2.6 GPA assignment to genomospecies groups ...... 80 2.7 Comparison of RAxML and FastTree methods ...... 82

3.1 HT genes patterned by function ...... 94 3.2 GC content of vertical and HT genes ...... 96 3.3 GC content of old and recent HT genes ...... 98 3.4 Timing of HGT across GPA lineage evolution ...... 101 3.5 Common HGT donors into GPA ...... 103

4.1 Topology of HGT into GPA clade ...... 120 4.2 HGT-rich zones and origin-terminus symmetry ...... 123 4.3 HGT zones encompass genomic islands ...... 125 4.4 Selection on strand and GC content across HGT zones ...... 128 4.5 Spatio-functional patterns of gene flow into the GPA clade ...... 132 4.6 Forespore development genes positionally conserved in GPA . . . . . 135 4.7 HT gene topology may be driven by spore formation ...... 138

12 S1 Topology of HGT into individual GPA genomes ...... 187

13 List of Tables

2.1 Gene family and protein processing at different thresholds ...... 64 2.2 Raw GPA gene history predictions by Mowgli ...... 73 2.3 Filtering reconciliation results ...... 77 2.4 Identifying old and recent HGT events ...... 80

3.1 Annotating COGs at different transfer costs ...... 88 3.2 COG assignment at one transfer cost (T4) ...... 92

S1 Bacillaceae species growth temperatures ...... 189

14 Abbreviations

AIMS architecture imparting sequence ...... 30

AT adenine and thymine...... 20

CDS coding sequence...... 57

COG Cluster of Orthologous Groups ...... 22

CV coefficient of variation ...... 65

D duplication (in the context of reconciliation) ...... 54

DNA deoxyribonucleic Acid ...... 19

FISH fluorescence in situ hybridization...... 159

GC guanine and cytosine ...... 25

GI genomic islands ...... 31

GPA members of the genera Geobacillus, Parageobacillus, Anoxybacillus ...... 36

HGT horizontal gene transfer ...... 19

HPC high performance computing...... 58

HT horizontally transferred ...... 19

ILS incomplete lineage sorting ...... 44

15 kb kilo base pairs ...... 31

L loss (in the context of reconciliation) ...... 54 lHGT long distance HGT ...... 76

M million...... 57

Ma million years ago ...... 39

MCL Markov Chain Algorithm ...... 48

ML maximum likelihood...... 51

MSA multiple sequence alignment ...... 50

Myr million years ...... 43

NCBI National Center for Biotechnology Information...... 57 protID non-redundant NCBI protein identifier...... 57

RBBH reciprocal best BLAST hit ...... 47

RBS ribosome binding site ...... 155

RNA ribonucleic acid ...... 159 rRNA ribosomal RNA ...... 36 sHGT short distance HGT ...... 76

SPR subtree pruning and regrafting ...... 54

T transfer (in the context of reconciliation) ...... 54 taxid taxonomic identifier ...... 60 tRNA transfer RNA ...... 31

16 List of Publications

Publications arising from this thesis

Esin, A., Ellis, T., & Warnecke, T. (2018). Horizontal gene flow into Geobacillus is constrained by the chromosomal organization of growth and sporulation. BioRxiv. https://doi.org/10.1101/381442

Other publications

Esin, A., Bergendahl, L. T., Savolainen, V., Marsh, J. A., & Warnecke, T. (2018). The genetic basis and evolution of red blood cell sickling in deer. Nature Ecology and Evolution, 2(2), 367–376. https://doi.org/10.1038/s41559-017-0420-3

The above manuscript was the culmination of work undertaken by me, my supervisor Tobias Warnecke, and our collaborators during my graduate studies. Since this study was unrelated to my main research initiative, it is not presented in the main body of this thesis. Nevertheless, my contribution to the findings put for- ward in the manuscript represent many months of work. As such, I have included a short section outlining the key facets of the study, followed by the full manuscript, in Appendix C. The manuscript has been reproduced here in accordance with the pub- lisher’s rules (https://www.nature.com/reprints/permission-requests.html).

17 Chapter 1

Introduction

18 Chapter 1 1.1. GENE TURNOVER IN BACTERIAL GENOMES

In this thesis, I explore the dynamics of horizontal gene transfer (HGT) into a model group of in an effort to understand the guiding principles shaping the “what”, “when”, and “where” of gene gain. By focusing on a model group that is an emerging bioengineering chassis, I hope that my findings could have a direct impact on future synthetic biology work. My analysis begins with detection of HGT. To achieve this, I use a common set of methods that I have augmented for speed and accuracy - a feat possible because of my focus on a model group. I consider the predicted horizontally transferred (HT) genes in the context of previous works to determine whether the inference approach has truly identified genes derived from transfer events. Finally, I interrogate the extent to which HT genes occupy distinct spatial niches within the host genome, and what may drive the sequestration patterns that we see - questions that have not been systematically explored in a model group until now. Aside from understanding how HGT has shaped microbial evolution, interest in the field is also driven by genome engineering efforts - the ability to harness this natural phenomenon to augment genomes with new functions in the laboratory has had a big impact in both research and biotechnology. Increasing our understanding of the principles guiding successful HGT over the course of evolution will, certainly in the longer term, continue to improve the tractability of genome engineering in the field of synthetic biology.

1.1 Gene turnover in bacterial genomes

The continued, ever faster, accumulation of fully sequenced bacterial genomes has led to an intriguing observation: in a given genome, a subset of genes will be found conserved in closely related genomes (such as strains of a species, or species within a genus) while the remainder will be specific to that genome - not having been inherited vertically from the common ancestor (Koonin and Wolf 2012). Indeed, more than 15 years ago, an examination of three Escherichia

Chapter 1 1.1.0 19 Chapter 1 1.2. HGT IN EVOLUTION: PREFERENCE TO PATTERN coli strains already showed that fewer than 40% of their combined non-redundant protein-coding genes were shared by all three genomes (Welch et al. 2002; also see Hayashi 2001). Such observations have led to the concept of ‘core’ and ‘flexible’ (‘accessory’) genomes. The core genome includes the genes necessary for a group of organisms to inhabit some common niche but it is the genes comprising the flexible genome that allow highly specialised adaptation to specific environmental conditions. It is this flexible genome that experiences the lion’s share of gene turnover; genes can be lost, and genes can be gained by HGT, though not necessarily in equal measure. Although some transient constituents of the flexible genome experience high rates of both gain and loss, such as genes associated with transposable elements, it is generally acknowledged that gene loss is the pervasive force guiding microbial genome evolution (Mira, Ochman, and Moran 2001). The ease with which microbes discard non-functional or unneeded DNA from their genomes by purifying selection is a key consideration factor in some approaches of HGT inference, e.g. when alternative evolutionary scenarios are compared (see Chapter 2).

1.2 HGT in evolution: preference to pattern

HGT is a seemingly ubiquitous process amongst prokaryotes (Treangen and Rocha 2011) and a major source of genetic diversity, with rates of gene gain exceed- ing expansion of existing gene families by two orders of magnitude (Puigbò et al. 2014). On the back of many evolutionary analyses, horizontal transfer is now widely considered to be the major process underpinning prokaryotic gene innovation and microbial ability to colonise new niches (Gogarten and Townsend 2005; Hehemann et al. 2016; Nelson-Sathi et al. 2012). Despite this, the degree of dynamism with respect to what, when, and where a transfer occurs is still not fully clear. Ignor- ing the mechanisms controlling the expulsion of DNA from the donor, its transfer

Chapter 1 1.2.0 20 Chapter 1 1.2. HGT IN EVOLUTION: PREFERENCE TO PATTERN

- possibly via a vector, and uptake by the host cell (see Thomas and Nielsen 2005, for a review on mechanistic aspects of HGT), multiple factors affect the successful integration and retention of the acquired DNA. Some characteristics of the incoming genes are more suitable for HGT, and when enough HGT-derived genes are anal- ysed, preferences turn into observable patterns - for example, HT genes tend to be more rich in adenine and thymine (AT) nucleotides than the host genome (see below). Over many works, a plethora of patterns have been observed that seem to be characteristic of HT genes (see Popa and Dagan 2011): these concern not only the gene itself (e.g. its nucleotide content, codon usage, product features), but also the source of the gene, the timing of gene gain over lineage evolution, and where in the host genome the gene finds a home. Below, I discuss the known preferences for (and obstacles to) HGT, the consequent patterns, and how some of these can be leveraged to help distinguish HT genes. I start with what is arguably the most important factor determining the gain and loss of a gene - the function of its protein product.

1.2.1 Function

The most clear and persistent selective force acting on a gene concerns its utility: for longer-term retention, the function of the gene product should confer an adaptive advantage to the organism in its environment, even if only occasionally. Even genes or elements that are typically considered ‘selfish’, such as the highly mobile transposases, may be useful to the host by facilitating adaptive genome re- arrangement, regulating gene expression, and promoting transfer of other functional genes (Casacuberta and González 2013; Ellis and Haniford 2016; Frost et al. 2005). Thus, changes in ecology or lifestyle can result in turnover of no longer useful genes and the acquisition of new genes to fit new needs.

Chapter 1 1.2.1 21 Chapter 1 1.2. HGT IN EVOLUTION: PREFERENCE TO PATTERN

Connectivity - not function

Returning to the concepts of the core and flexible genomes introduced above, the first characterisation of this subdivision came from the observation that prokaryotic genomes can be seen as coarsely divided into two different groups of genes distinguished by their function (Rivera et al. 1998). Older diverging genes, broadly referred to as ‘informational’, encode products involved in essential functions such as translation, transcription, and replication and are typically found in the core genome. The aversion of these gene types to undergo HGT was described as part of a framework termed ‘The Complexity Hypothesis’ (Jain, Rivera, and Lake 1999). The authors suggested that informational genes are less likely to be transferred as they are often members of large, complex, and co-evolved systems: turnover of these with HGT would be akin to removing a gear from a perfectly functional clock mechanism and replacing it with a slightly missized gear - the clock would work less well or not at all. Later work looking at protein complexes clarified the above observation, suggesting that it is in fact the connectivity of the product (the number of protein-protein connections), rather than its specific function, that influences transfer propensity (Cohen, Gophna, and Pupko 2011). Although this is a trend rather than a rule - transfer of even the most highly conserved and networked genes can occur (Gogarten, Doolittle, and Lawrence 2002; Lind et al. 2010) – the data suggest that genes encoding products with fewer protein-protein connections have a reduced barrier to transfer. For example, genes that confer resistance to antibiotics, such as the beta-lactamase family of enzymes, are capable of providing immediate selective advantage by solitary expression and have a long history of HGT (Barlow and Hall 2002). Connectivity, therefore, has emerged as a strong predictor of HGT. However, in lieu of calculating protein connectivity for each genome studied, can function still be used as a proxy?

Chapter 1 1.2.1 22 Chapter 1 1.2. HGT IN EVOLUTION: PREFERENCE TO PATTERN

Patterns in function

A common and simple way of stratifying gene function in microbes is using so called Clusters of Orthologous Groups (COGs) (Tatusov et al. 2003). The broad pooling of functions into ~20 functional classes results in some, e.g. translation (COG J), that are almost entirely made up of highly-connected informational genes and are highly resistant to transfer (Sorek et al. 2007). Others, such as carbohydrate metabolism (COG G), are dominated by genes encoding poorly connected products. Thus, differential enrichment in COG functions can be seen in both the HGT-derived flexible genome (metabolic and defence (COG V) functions), and vertically-inherited core genome (translation (COG J), cell motility (COG N), and cell cycle control (COG D)). Interestingly, this functional bias can be observed in core and flexible genomes calculated for both recently diverged taxa (e.g. strains in Davids and Zhang 2008) and much more ancient, phylogenetically broad groups (e.g. phyla in Deschamps et al. 2014). In summary, genomic adaptation by HGT is largely restricted to the flexible genome and favours the exchange of genes whose products are stand-alone and not dependent on existing, complex protein networks. Critically, this pattern is observed throughout prokaryotic evolution. This connectivity-driven bias is reflected in the functional annotation of HT genes, and can provide a litmus test of putative HT and non-HT gene sets (see Chapter 3). Of course, while biological function is the key determinant in long-term selection for or against a gene within a microbial genome, and genes that can work independently are preferred in this regard, some HT genes never make it that far and are discarded at, or shortly after, integration.

1.2.2 Form follows function

Several studies have observed that HGTs most frequently occur between closely related organisms (Puigbò, Wolf, and Koonin 2010; Skippington and Ra- gan 2012; Williams, Gogarten, and Papke 2012). The preference to acquire genes

Chapter 1 1.2.2 23 Chapter 1 1.2. HGT IN EVOLUTION: PREFERENCE TO PATTERN from closely related taxa seems straightforward: genes from phylogenetically close genomes will share similar compositional qualities (e.g. nucleotide content, promoter recognition, amino acid pools) that will facilitate their smooth integration into the recipient genome - to plug-and-play. Transfers with highly divergent compositions from the recipient genome may struggle to integrate, be selected against by defence systems, and may be expressed poorly. Nevertheless, gene acquisition from distantly related genomes has been documented (Mongodin et al. 2005; Nelson et al. 1999; Popa et al. 2011), demonstrating that these barriers can be overcome, especially if the gene in question provides a significant adaptive benefit. The importance of these barriers, then, is dependent on the frequency of transfers between distantly related taxa - are such events common in microbial evolution?

Shared ecology drives gene exchange

Analysis of 144 prokaryotic genomes revealed the presence of so-called “highways” of gene sharing between not only closely related taxa but also more distantly related organisms that shared the same environment (Beiko, Harlow, and Ragan 2005). Smillie et al. (2011) further dissected the role of ecology in shaping HGT in the context of the human microbiome. By deconvoluting ecology, geog- raphy, and genome similarity the authors revealed it is primarily shared ecology that structures the gene exchange network, with negligible contributions from phys- ical and phylogenetic distances between donor and recipient. Similar observations outside of the human microbiome include prolific intergeneric gene exchange in an Antarctic lake (DeMaere et al. 2013), spread of the ability to reduce oxygen among hyperthermophilic anaerobes (Le Fourn et al. 2011), and extensive gene gain during Staphylococcus aureus colonisation in vivo (McCarthy et al. 2014). Thus, organisms sharing the same ecological niche are clearly a valuable source of genetic diversity for potential hosts seeking to adapt to, and thrive in, the same environment. If a significant proportion of transferred genes might not conform to the recipient’s

Chapter 1 1.2.2 24 Chapter 1 1.2. HGT IN EVOLUTION: PREFERENCE TO PATTERN preferred composition (since they are coming from ecologically similar but phyloge- netically divergent donors), what are the tolerable limits to such dissimilarity and how are these discrepancies ironed out after integration?

Limitations at the level of DNA

Similarity in donor-recipient nucleotide content has been linked to the in- creased rates of HGT between closely related taxa. Popa et al. (2011) observed that in 86% of donor-recipient relationships the difference in their genomic guanine and cytosine (GC) nucleotide contents was less than 5%. Interestingly, however, multiple studies have shown that genes acquired by HGT have a lower GC content than the host genome (Lawrence and Ochman 1997; Médigue et al. 1991; Syvanen 1994). This trend holds true seemingly regardless of the AT-richness of the recip- ient, and this observation was shown to not be an artefact of employed detection methods (Daubin, Lerat, and Perrière 2003). Daubin and co-authors hypothesised that the AT-richness of acquired genes may be due to either the GC-preference of host-expressed restriction enzymes or the AT-preference of HGT vectors, such as in- sertion sequences and phages. The exact mode by which genomes select for AT-rich HGTs, and how that interplays with existing intragenomic compositional biases - in itself not fully understood (Hershberg and Petrov 2010; Hildebrand, Meyer, and Eyre-Walker 2010; Raghavan, Kelkar, and Ochman 2012; Rocha, Touchon, and Feil 2006), remains unclear. The picture is further complicated when considering codon usage bias. HT genes fall into a separate category of codon usage (Médigue et al. 1991; Moszer, Rocha, and Danchin 1999), and surprisingly do not appear to carry codon-level signatures of their previous genomic hosts (Daubin, Lerat, and Perrière 2003). While some work suggests that mismatched codon bias between HT genes and the host genome might indeed be a barrier to transfer (Medrano-Soto et al. 2004; Tuller et al. 2011), others have suggested that this ‘barrier’ is at best a minor hurdle (Amorós-

Chapter 1 1.2.2 25 Chapter 1 1.2. HGT IN EVOLUTION: PREFERENCE TO PATTERN

Moya et al. 2010; Kudla et al. 2009). Taken together, these observations indicate a bias in the composition of acquired genes though the causes of this are incompletely understood.

Limitations in expression

Even if a gene matches the required nucleotide and codon requirements, successfully integrates into the chromosome and avoids silencing or eviction, it still needs to be expressed - even if only on-demand. Given the general adjacency of gene and promoter, cotransfer of the two must be common, and expression in the host is more likely if the promoter nucleotide content is similar to that of the host’s native expression system. However, arrival and subsequent high-level expression can have a negative effect (Sorek et al. 2007), which is perhaps why HT genes have been found to be expressed at lower than average levels (Davids and Zhang 2008). The latter ob- servation may also be linked to the fact that HGT is associated with predominantly operational (metabolic) genes. Many genes are acquired as extensions of existing metabolic circuitry (Davids and Zhang 2008; Dorman 2009) and so are presumably expressed in an ad hoc manner. As with the above observations, aberrant expression can be a barrier to HGT, although negative effects can presumably be outweighed by fitness gains conferred by the new function. In such cases, compositional and expression irregularities of HT genes are expected to erode over time as a result of compensatory evolution.

Reducing dissimilarity between HT gene and host

The diminution of differences in the compositional signatures of the host genome and acquired DNA is summarised by the concept of amelioration, intro- duced by Lawrence and Ochman (1997). The authors described the change in the nucleotide content of HGT-derived genes to match the signature of the hosts (enteric bacteria) over evolutionary timescales. This change need not be passively driven by

Chapter 1 1.2.2 26 Chapter 1 1.2. HGT IN EVOLUTION: PREFERENCE TO PATTERN only the intrinsic mutational biases acting on the host genome, but can also be driven by selection. For example, Hao and Golding (2006) observed that recently transferred genes had elevated Ka/Ks ratios in 13 Bacillaceae genomes (also pre- viously noted in Daubin and Ochman 2004). A high nonsynonymous/synonymous ratio may signify directional selection, the adaptation of these genes to a new local environment in their host, but this is difficult to distinguish from relaxed evolution of genes no longer required by the host and earmarked for eventual loss (Hao and Gold- ing 2006). Finally, amelioration of gene expression has also been observed. It was found that newly transferred genes have higher levels of gene duplication (Hooper and Berg 2003), probably driven by a short term need for more gene product while under the control of a suboptimal promoter (Lind et al. 2010). Following millions of years, acquired genes integrate more fully into host regulatory networks (Lercher and Pál 2008), adjusting expression to fit host needs without having to resort to gene duplication. In summary, HGT arrivals into the genome face a number of obstacles - highlighted by the particular patterns observed in successfully integrated DNA. For example, the remarkable consistency of HT gene AT-richness, regardless of host GC content, driven by an unclear mechanistic or evolutionary force. What is certain is that deviation from optimal signatures can be tolerated, and compensated in the short and longer terms by amelioration. The strong signal present in new arrivals, disappearing over evolutionary timescales, forms the backbone of some HGT detec- tion methods (see Chapter 2, Section 2.1.1) and can serve as a qualitative check to increase confidence in HT gene detection achieved by orthogonal approaches (see Chapter 3, Section 3.1.1).

1.2.3 HGT: a matter of time

Given the role of HGT in the adaptation of an organism to novel envi- ronmental conditions that are likely to change over time, what are the implications

Chapter 1 1.2.3 27 Chapter 1 1.2. HGT IN EVOLUTION: PREFERENCE TO PATTERN for the timing of gene gain over lineage evolution? Some prior studies of micro- bial genome evolution suggested that gene loss operates in a continuous, clock-like manner over evolutionary time (Puigbò et al. 2014; Snel, Bork, and Huynen 2002), though there is also evidence of “bursts” of gene loss, e.g. the rapid genome reduc- tion in obligately symbiotic bacteria (McCutcheon and Moran 2012). In genomes not undergoing extreme genome reduction, loss must be to some extent offset by gene gain via HGT. A biphasic model of prokaryotic gene turnover was proposed - short periods punctuated by massive gene gain followed by pervasive gene loss over longer timescales to once again streamline the genome (summarised in Wolf and Koonin 2013). This bursting model of HGT has notable examples in evolution, e.g. the transfer of thousands of genes from the bacterial endosymbionts on their way to becoming mitochondria and chloroplasts in the origin of eukaryotes, and plants and algae, respectively (see Timmis et al., 2004). In a similar vein, large influxes of genes was suggested to be the source of multiple major archaeal clades, allowing their colonisation of novel niches (Nelson-Sathi et al. 2012, 2015). The authors used an ad hoc approach of timing gene gain that showed a burst of HGT at the roots of the groups, with the rest of the lineage dominated by gene loss (Nelson-Sathi et al. 2015). Interestingly, their finding was quickly challenged and it was suggested that the data were consistent with a gradual accrual of HGTs over the evolution of archaeal lineages when more established methods are applied (Groussin et al. 2016). It is clear that determining the timing of gene gain, even accepting the unlikely as- sumption that it follows the same pattern across all microbes, is largely a question of accurate inference of HGT events in the context of alternative evolutionary scenar- ios. I address relevant aspects of HGT inference at length in Chapter 2. However, regardless of whether genes arrive piecemeal or in a flash flood they all share a need for somewhere to go - a place in the host genome.

Chapter 1 1.2.4 28 Chapter 1 1.2. HGT IN EVOLUTION: PREFERENCE TO PATTERN

1.2.4 Space: the final frontier

Importance of bacterial genome organisation

Bacterial genome organisation can be highly structured and gene location is often integral to both the function of the gene and its impact on host fitness. So, it is unsurprising that particular chromosomal locations may be more or less welcoming to newcomers. The origin of replication is a highly conserved region amongst bacteria, necessary for the initiation of DNA replication, and is the only essential cis-acting region of the E. coli chromosome (Kato and Hashimoto 2007). Bacteria typically have a single origin (see Mott and Berger 2007) and insertions in the region are likely to disrupt replication initiation resulting in severe, if not fatal, consequences. Similarly, the high conservation of certain operons, e.g. those encoding ribosomal proteins (Itoh et al. 1999), suggests that these streamlined regulatory units are resistant to transferred genes. So, comparison of genomes can reveal regions that are refractory to HGT at a local level. Genome architecture is also shaped by more global forces. For example, constitutively highly expressed genes are often clustered around the origin of repli- cation. There, genes primarily involved in translation and transcription can take advantage of functional polyploidy in quickly replicating genomes and benefit from increased gene dosage simply due to their chromosomal location (Couturier and Rocha 2006). Consistent with the above observation, moving highly expressed ri- bosomal proteins to an origin-distal location in Vibrio cholerae resulted in reduced growth rate and diminished capacity for invasion of its host, Drosophila melanogaster (Soler-Bistué et al. 2015). This phenotype was rescued by introducing multiple origin-distal copies of the rearranged genes, confirming the importance of transiently elevated gene dosage during replication. This preference for having genes involved in translation and growth near the origin is underlined by the observation that genomes

Chapter 1 1.2.4 29 Chapter 1 1.2. HGT IN EVOLUTION: PREFERENCE TO PATTERN with greater enrichment for translation (COG J) genes near the origin also have the largest bias for genomic inversions that are symmetric around the origin-terminus axis (Repar and Warnecke 2017). Such symmetric rearrangements do not affect the distance of the gene to the origin, and so would maintain the expected dosage. Thus, depending on the function of the acquired gene, insertion at a suboptimal distance from the origin may result in either insufficient or excessive amount of product. Further, genes show a similarly pronounced bias for being in the leading strand of replication (Rocha 2008). The basis for this preference is thought to be the potential for head-on collision between the replicating DNA polymerase and transcribing RNA polymerase in genes positioned on the lagging strand (Mirkin and Mirkin 2005; Srivatsan et al. 2010). Interestingly, however, it appears that gene essentiality rather than expression dictates the likelihood of a gene being on the leading strand (Rocha and Danchin 2003a) - a pattern observed across multiple bacterial phyla (Rocha and Danchin 2003b). Experimentally, when either native or introduced reporter cassettes are oriented against the prevailing direction of tran- scription, they tend to be poorly expressed or, perhaps more critically, interfere with expression of neighbouring genes (Yeung et al. 2017). These knock-on effects to sur- rounding genes, at ranges in the kilobases, are at least partially caused by altered DNA topology resulting from transcription-induced supercoiling (Bryant et al. 2014; Ferrándiz et al. 2014). Aside from replication and gene expression, genome architecture can be driven by other factors. For example, the distribution of so-called architecture im- parting sequences (AIMS) along a particular strand of the bacterial chromosome allows proteins such as FtsK to locate the terminus region and direct faithful chro- matid segregation to daughter cells (Hendrickson and Lawrence 2006). The authors further suggest that maintenance of AIMS density and position across genomes of related bacteria, regardless of their genic context, signifies pervasive positive selec- tion on their location. This was later shown to be the case, as at least 18% of genes

Chapter 1 1.2.4 30 Chapter 1 1.2. HGT IN EVOLUTION: PREFERENCE TO PATTERN acquired in the terminus-proximal region were lost due to improper orientation of their AIMS (Hendrickson et al. 2018). Hendrickson et al. (2018) go on to suggest, however, that this barrier is surmountable if the gene in question provides sufficient functional benefit. Another bacterial process that is highly dependent on particular genome architecture is sporulation. When under severe stress, some microbes can produce highly resistant spores that can survive otherwise non-viable environments and ger- minate once conditions improve. In subtilis, commitment to the sporulation program involves chromosomal duplication, tethering of one copy to the cell mem- brane to each pole of the cell by the RacA protein, and, finally, asymmetric division with the formation of a septum one sixth of the cell length from one pole (Barák and Muchová 2018; Hilbert and Piggot 2004). The septation physically traps one third of one of the chromosomes within the newly formed forespore compartment. The remainder of the trapped chromosome is eventually actively translocated into the forespore (Frandsen et al. 1999), but for a critical period of forespore development gene expression and dosage are asymmetric between the forespore and mother cell compartments. This is highlighted by the presence of a gene cluster located entirely within the first third of the B. subtilis chromosome that is under the control of a single sigma factor - σF (Wang et al. 2006b). This regulator is only expressed during the early stages of the sporulation program (Losick and Stragier 1992). By comparison, genes under the control of the next regulator in the cascade, σG, which becomes active after DNA translocation into the forespore, do not show positional constraint (Wang et al. 2006b). Thus, it is not hard to see how integration of new genes onto the main chromosome can disrupt the established, adaptive genome architecture. New genes must be careful to not disrupt local structures, such as the origin or essential operons, integrate at a suitable distance from the origin, in a suitable orientation, and without affecting architecture-dependent processes such as segregation or sporulation.

Chapter 1 1.2.4 31 Chapter 1 1.2. HGT IN EVOLUTION: PREFERENCE TO PATTERN

HT genes are patterned by location

The abundant spatial obstacles likely explain why HGT-derived genes are often relegated to secondary chromosomes or (Cooper et al. 2010; Rocha 2008). Where genes do integrate on the main chromosome, they are typically found in clusters (Dilthey and Lercher 2015; Oliveira et al. 2017; Touchon et al. 2009). Par- tially, this reflects co-transfer of genes that are functionally co-dependent, i.e. would be less (maybe not at all) useful if transferred in isolation. For example, analysis of recent HGT events amongst 21 γ-proteobacteria showed significant co-clustering of functionally related gene pairs (Dilthey and Lercher 2015), echoing the selfish operon model (Lawrence and Roth 1996). HT gene proximity is also accounted for by the presence of HGT-permissive zones, or hotspots of gene turnover, along the chromosome (Oliveira et al. 2017). Such hotspots can form and be reinforced through specific integration biases by the action of nearby integrases or sites that promote recombination (e.g. dif sites in E. coli) (Touchon, Bobay, and Rocha 2014). Larger hotspots, between 10 and 200 kilo base pairs (kb), that possess particularly distinct compositional signatures and are often associated with boundary elements (such as transposases, transfer RNAs (tRNAs), repeat sequences), have been defined as genomic islands (GIs) (Juhas et al. 2009). GIs are often further subclassified ac- cording to their gene content, e.g. pathogenicity, metabolic, resistance (Dobrindt et al. 2004). On a more global topological scale, HT genes have been found to accumu- late towards the terminus in multiple species that include E. coli and B. subtilis (Lawrence and Ochman 1998; Moszer, Rocha, and Danchin 1999; Rocha 2004; Tou- chon, Bobay, and Rocha 2014; Zarei, Sclavi, and Cosentino Lagomarsino 2013). This pattern might be driven by the need to avoid excessive expression of the HT genes due to the replication-dependent dosage effect described above. Similarly, housekeeping genes in species are found at the centre of the linear chromosomes whereas transposable elements are preferentially positioned closer to

Chapter 1 1.2.4 32 Chapter 1 1.3. PERSPECTIVES FOR SYNTHETIC BIOLOGY the telomeres (Choulet et al. 2006). Recently, an analysis of HGT hotspots across a diverse set of bacterial genomes reiterated the above observation (Oliveira et al. 2017). However, after stratifying the identified hotspots the authors observed that only those hotspots including prophages linearly increased in frequency with increas- ing distance from the origin of replication. Interestingly, the majority of hotspots identified in the study - those neither associated with prophages nor integrative con- jugative/mobilizable elements - showed peaks of enrichment and depletion along the chromosome that were not obviously linked to replication. In summary, HGT-derived genes cluster along the bacterial chromosome, both due to co-transfer and due to the existence of more permissive sites along the genome that facilitate integration. However, the distribution of these hotspots of gene turnover along the chromosome cannot be entirely (or even mostly) explained by replication - a dominant force in adaptive genome organisation. I wondered whether topological biases may be obscured by aggregate analysis (such as performed in Oliveira et al. 2017). The rationale for what genes go where might become clear if a particular group with a more restricted lifestyle (and so adapted genome architecture) were to be analysed. Finally, any spatial patterns identified within a particular group might provide pointers for rule-based genome engineering.

1.3 Perspectives for synthetic biology

DNA acquisition is subject to strong counter-selection over time if the gene or product causes any burden on the host cell, underlined by the millions of years it takes for genes to fully integrate into host regulatory circuits (Lercher and Pál 2008). Burden can be imposed by the arriving gene, both in evolutionary and synthetic contexts, by not exactly adhering to the adapted dynamics and architecture of the host - the barriers to HGT discussed above. This then could pose an obstacle to engineering stable regulatory networks, or even establishing heterologous expression

Chapter 1 1.3.0 33 Chapter 1 1.3. PERSPECTIVES FOR SYNTHETIC BIOLOGY of a single gene, in genomes of interest. Here it can be useful to consider the similarity between biological cells and computing machines. In both there are two distinct modules that drive overall activ- ity: the hardware represents the physical components (e.g. the cellular components, the DNA) and the software represents the programme governing the interaction of physical components that results in function (Danchin 2012; Lorenzo and Danchin 2008). In a computer, one can imagine that the operating system can, to a large extent, be hardware agnostic - if the hardware is destroyed, the operating system can still function with a similar set of components. To what extent then can the software be separated from the hardware in a biological cell? There have been some prior attempts to separate the software from some parts of the hardware. Lartigue and colleagues (2007) successfully transplanted the whole genome of one Mycoplasma species into the genome-free cell of another. De- spite this success, later work showed that this detachment of hardware and software had limits: there was a negative correlation between increasing phylogenetic and genome transplantation success (Labroussaa et al. 2016). On a smaller scale, partial redesign of the T7 bacteriophage genome to be more logically ordered (in human terms!) resulted in a viable organism, but one that produced smaller lysis plaques than the wild-type (Chan, Kosuri, and Endy 2005). Thus, understanding the pro- gramme connecting the underlying physical components appears to be insufficient to replicate - never mind enhance - fitness in the wild for even the simplest biological systems. These examples demonstrate that there is a clear benefit to be gained from a deeper understanding of the forces governing both the software and hardware of genome organisation. Investigating the drivers of HGT gain and host adaptation during evolu- tion can provide pointers for rule-based engineering - natural HGT has to overcome the same barriers as synthetically integrated genes. Some patterns derived from evolutionary analysis are already considered for synthetic constructs, e.g. in pro-

Chapter 1 1.3.0 34 Chapter 1 1.3. PERSPECTIVES FOR SYNTHETIC BIOLOGY moter choice, gene nucleotide content, and codon usage bias. However, adaptive genome architecture is rarely taken into account. Partially, this is due to the fre- quent use of plasmids in model microbes, yet these can be turned over rapidly and require constant selection for maintenance (Friehs 2004). Chromosomal integration confers greater stability and may, in many instances, be preferable, but requires the development of toolkits that facilitate site-specific integration into the host. There has been a drive to develop such methods in model organisms such as E. coli (Gu et al. 2015; St-Pierre et al. 2013; Wei et al. 2010). As these technologies progress and become applied to a broader range of systems, the importance of considering genome architecture in selecting suitable integration sites will increase (Danchin 2012). Beyond avoiding fitness pitfalls by integrating genes into sites evolutionar- ily primed for gene turnover, design that takes into account spatial patterns could also augment bioengineering attempts. For example, if there is a need to express a heterologous gene at a high level in a quickly replicating genome, one could take advantage of replication-dependent dosage by integrating the gene in the permissive zone closest to the origin of replication. But can one avoid disrupting conserved local architecture? In a broader context, understanding the genomic landscape is a key require- ment for the tractability of organisms serving as chassis for synthetic biology. The choice of chassis has been dominated by E. coli and B. subtilis amongst prokary- otes, largely due to a comparatively long history of laboratory use. However, more recently, there have been increasing calls in the field to develop more varied and ver- satile chassis (Adams 2016; Kim et al. 2016). Such demand provides an opportunity to combine the investigation of spatial patterns of HGT in a specific genomic context - interesting from an evolutionary perspective, with the more engineering-oriented characterisation of the forces shaping genome evolution in a novel chassis. One of the groups highlighted as having significant chassis potential is the thermophilic genus Geobacillus. Its ability to grow and produce enzymes at

Chapter 1 1.3.0 35 Chapter 1 1.4. THE GPA CLADE high temperatures augments its catabolic versatility in industrial use (Adams 2016; Cripps et al. 2009). In addition, its relatively close relation to B. subtilis partially offsets the few Geobacillus-specific tools available for genetic modification. Focusing on a single, closely related group that shares a distinct lifestyle makes it possible to track adaptation shaped by environment over evolutionary time without sacrificing the number of detected HGT events. Finally, analysis of gene flow into a single clade, rather than a more traditional “all versus all” approach, can have the advantage of significant computational savings (see Chapter 2).

1.4 The GPA clade

Geobacillus are aerobic and facultatively anaerobic, endospore-forming, Gram-positive thermophiles belonging to the non-monophyletic Bacillaceae family nested in the phylum (Aliyu et al. 2016; Zhang and Lu 2015). The genus was originally established to differentiate the phenotypically and phylogenetically similar group of thermophilic bacilli (Bacillus group 5) that had a high degree of coherence in their 16S ribosomal RNA (rRNA) sequences (Nazina et al. 2001). The genus has undergone several taxonomic revisions. Relevant to this work were two successive reassignments of Geobacillus subclades to the new genera Anoxybacillus (Coorevits et al. 2012) and Parageobacillus (Aliyu et al. 2016). Despite the shift- ing nomenclature, Geobacillus, Parageobacillus, and Anoxybacillus (henceforth GPA) cluster together into a single monophyletic clade within the Bacillaceae (Figure 1.1). Members of the GPA clade have been isolated all over the globe, from both land and marine environments (Zeigler 2014). GPA has been sampled thousands of metres above sea level in the Andes (Marchant et al. 2002), at the bottom of the Mariana Trench (Takami et al. 2004), and in oil wells and gold mines (Rastogi et al. 2009; Wang et al. 2006a). Most reliably, GPA members can be isolated from hot springs (Pinzón-Martínez et al. 2010), geothermal soils (Meintanis et al. 2006), hot

Chapter 1 1.4.0 36 Chapter 1 1.4. THE GPA CLADE

Figure 1.1: The GPA clade in the Bacillaceae family. The 25 GPA genomes used in this work are nested within the broader Bacillaceae family, surrounded by the polyphyletic Bacillus genus. Outside the GPA, other thermophiles are also found in the generally mesophilic Bacillaceae. The phylogeny was reconstructed from a concatenated alignment of 1-to-1 orthologues shared by all genomes, identified at an E-value threshold of 1E-50 (see Chapter 2 Section 2.3.5 for details). The optimal growth temperatures, identified for each tip of the phylogeny, were collated from several sources (see Table S1).

Chapter 1 1.4.0 37 Chapter 1 1.4. THE GPA CLADE composts (Takaku et al. 2006), and have been frequently found as contaminants of processed food and dairy products as spores can survive heat treatment (Seale et al. 2012; Sevenier et al. 2012). The observation that GPA spores are consistently iden- tified in environments that never reach the minimal temperature required for growth has led to a suggestion that the spores have adapted for long-distance atmospheric transport (Zeigler 2014). The GPA clade, with optimal growth temperatures varying between 45 and 70 ℃ (Nazina et al. 2001, see Table S1), is one of a handful of thermophilic Bacillaceae groups (McMullan et al. 2004). Although it has been suggested that the Bacillaceae ancestor may have been thermophilic (Hobbs et al. 2012), the prevalence of mesophilic taxa (Figure 1.1) and the relatively few thermophilic groups may instead indicate that higher temperature tolerance was acquired at the roots of those groups. Horizontal transfer of genes that aid thermophily has been described previously (Boucher et al. 2003), though the signatures are more distinct for transfers into and between hyperthermophiles. While it may be difficult to implicate HGT as the source of thermophily in GPA, there is other evidence that gene transfer has played a role in the evolution of this lineage. Comparative analyses of the core and flexible genomes indicate that exten- sive HGT may have shaped GPA genomic diversity (Bezuidt et al. 2016; Studholme 2015). Bezuidt and co-authors specifically implicate gene flow from phylogenetically distant taxa into GPA, a conclusion fitting the distinct lifestyle that characterises GPA. At a finer scale, several studies have identified candidate HGTs into GPA. In a comparative study involving Geobacillus kaustophilus and five Bacillus species, Takami et al. (2004) identified 839 genes with no orthologues in the bacilli. Of these, the authors highlighted genes encoding a putative spermine synthase, and tRNA/rRNA methyltransferases as possibly having a role in Geobacillus thermoad- aptation. Clusters of genes involved in arabinan utilisation, belonging to a GI-like region of the Geobacillus stearothermophilus str. T-6 genome, have also been iden-

Chapter 1 1.4.0 38 Chapter 1 1.5. AIMS OF THIS THESIS tified (Shulami et al. 2011). These genes, along with systems for breaking down xylan and galactan, impart the capacity to process hemicellulose. Finally, a number of alkane hydroxylase (alkB) genes, necessary for petroleum biodegradation, were identified in several GPA genomes having likely been gained through horizontal transfer (Tourova et al. 2008). However, while HGT is considered to be a significant contributor to GPA evolution, and anecdotal evidence suggests some genetic influx, explicit patterns of gene flow have not yet been characterised on a genome-wide scale for this clade. The attractiveness of GPA as a model to study gene flow is further in- creased by the clade being quite closely related to B. subtilis. A recent conservative estimate of divergence places the most common recent ancestor of Geobacillus and B. subtilis at 799 million years ago (Ma) (Marin et al. 2017), and this separation may be close enough for some comparative analysis of genome architecture. A suggestion that both Bacillus and GPA share some aspects of adaptive genome organisation can be gleaned from the observation that large-scale recombination events in the Bacil- laceae, including GPA, predominantly occur along the origin-terminus axis (Repar and Warnecke 2017). This indicates that the genomes in these species might be under strong topological constraint resulting in the recalcitrance to rearrange asym- metrically. This facet of genome evolution is also useful in pattern detection as any patterns are unlikely to be obscured by historical asymmetrical rearrangements.

1.5 Aims of this thesis

This main objective of this thesis is to characterise gene flow into GPA, a model bacterial clade that is gaining popularity as a chassis in synthetic biology, with the aim of identifying patterns of HT gene distribution along the genome. Bi- ased distribution of HGT-derived genes would indicate the presence of topological constraints imposed by adaptation and may in turn provide useful pointers for en-

Chapter 1 1.5.0 39 Chapter 1 1.5. AIMS OF THIS THESIS gineering the chromosome in the laboratory. Analysis of HT gene dynamics in GPA requires the robust detection of genes that have been acquired via HGT. Thus, the first specific aim of this thesis is to accurately identify a set of genes across suitable GPA genomes that have been gained via horizontal transfer. In Chapter 2 I apply a phylogenetic approach to conservatively identify GPA genes derived from HGT events and genes inherited vertically from the GPA common ancestor. By employing a reconciliation-based method to infer putative HGT in the context of alternative evolutionary events, I modulate the stringency of detection and produce several sets of HT gene predictions. Throughout the detection pipeline, I develop ad hoc methods to significantly reduce the computational costs of HGT inference by leveraging the need to only detect transfers into the GPA clade. In Chapter 3, armed with several sets of HT genes detected at increas- ing stringencies, I qualitatively evaluate them in the context of expected patterns (regarding gene functions, nucleotide composition, etc.) typically associated with HGT. The aim of this step is to increase confidence that predictions at a particular stringency represent true HGT events. Alongside analysing the nucleotide content and functional enrichment of HT and vertically-inherited genes, I also characterise the timing of gene gain into GPA and the major sources of these acquired genes. Confident in the quality of the HT gene set, I then characterise their dis- tribution along the GPA chromosomes in Chapter 4. I investigate the driving forces behind the biased topological distribution observed for HT genes, considering both the timing of gene gain and the function of the gene products. Finally, I suggest and interrogate a likely link between the spatial constraints imposed on HT genes and spore-formation in GPA.

Chapter 1 1.5.0 40 Chapter 2

Clade-specific HGT inference

41 Chapter 2 2.1. INTRODUCTION

2.1 Introduction

In this chapter I describe my approach for conservative detection of HGT events into the GPA clade. In the introduction, I first consider the two orthogonal approaches frequently used to detect HGTs in the field and follow with an analysis of key steps in a typical phylogeny-based pipeline, highlighting common sources of error. The results sections follow my HGT inference pipeline, specifically focusing on steps where I have augmented or supplemented existing methods in an attempt to increase overall accuracy and/or speed. As this is a methods-oriented chapter, the methods section is largely reserved for specific program parameters.

2.1.1 Compositional and phylogenetic approaches

Two approaches are frequently considered to detect exogenous DNA sequences in prokaryotic genomes: the composition-based approach and the phylogeny-based approach (Ravenhall et al. 2015). Both have been extensively and successfully used to detect HGT yet both also suffer from distinct weaknesses. In this section I consider the methodology underpinning both approaches and highlight the strengths and weaknesses of each; finally, I discuss the facets of the phylogeny-based methods that make it the more powerful choice for the purposes of this work.

Compositional methods

Composition-based approaches leverage the fact that transferred DNA can share some characteristics (“genomic signatures”, Karlin and Burge 1995) that might be different to the recipient genome. For example, bacterial genomes are known to have heterogeneous GC contents that range from ~13-75% (McCutcheon and Moran 2010; Thomas et al. 2008). Higher or lower genomic GC content might be driven by ecological selection pressures (Musto et al. 2004) or mutational bias (Hildebrand,

Chapter 2 2.1.1 42 Chapter 2 2.1. INTRODUCTION

Meyer, and Eyre-Walker 2010), but these explanations have been challenged (see Lassalle et al. 2015; Wang, Susko, and Roger 2006, respectively). Other genomic signatures can also be highly distinct between bacterial genomes and are often in- trinsically linked to one another. For example, GC content impacts codon usage bias (Muto and Osawa 1987). Nevertheless, these characters appear to be stable at the genome level over evolutionary timescales and are moulded by mutational processes and selection acting across the majority of the genome. By scanning over a genome, regions with diverging genomic signatures can be identified and confidence in the extrinsic origin of such stretches increases with greater divergence. Although individual signatures may suffer from poor resolution in certain circumstances (such as a transfer from a donor that shares a similar mean GC content), the power of detection can be augmented by simultaneously evaluating multiple metrics. For example, Karlin (2001) applied a combination of five methods (GC content anomalies; dinucleotide bias; codon usage bias; amino acid usage bias; inclusion of ‘putatively alien’ genes) in detecting GIs. By their nature, compositional methods are exceptionally suited to detecting recently acquired HT genes that have had limited time for amelioration. Compositional methods are not typically computationally demanding, and for the most part simply require the genome of interest for analysis. As such, they make scaling HGT detection to hundreds or thousands of genomes very tractable. Their second advantage lies in the inherent ability to detect contiguous stretches of alien DNA, including non-coding regions. Compositional approaches struggle, however, in detecting transfers from both closely related genomes, due to shared genomic signatures, and from distantly related but compositionally similar genomes. However, their most profound limitation is the inability to detect more ancient HGTs that have fully or partially ameliorated (see Chapter 1, Section 1.2.2). Lawrence and Roth (1996) predicted amelioration timescales for genes in the spa cluster of Shigella and Salmonella of around 300 Myr, yet the observed rates were at least

Chapter 2 2.1.1 43 Chapter 2 2.1. INTRODUCTION ten-fold higher. Clearly, the time taken for a gene to ameliorate will depend on the compositions of both the host and transferred gene, and may be further complicated by diversifying selection on the acquired sequence, changes in host ecology that skew genome signatures to a new optimum, and more complex gene histories involving the repeated gain and loss of the same gene. Certainly, when investigating over 100 million years (Myr) of lineage evolution, most HGT events would be hidden from detection by compositional methods.

Phylogenetic methods

Phylogeny-based methods, on the other hand, employ phylogenetic incon- gruence to detect HGT. At its most basic, an evolutionary relationship can be in- ferred by comparing a gene to its orthologues in other genomes, and paralogues within the same genome, at the sequence level (nucleotide or amino acid). Se- quences sharing high similarity may be considered to be more closely evolutionarily related than those with greater divergence, and the relationships can be explicitly modelled in a branching pattern called a “gene tree”. The gene tree is then com- pared to a so-called “reference” or “species” tree - this second phylogeny is expected to represent the vertical history of the organisms to which the genes in the gene tree belong. If a gene tree perfectly captures the evolutionary history of the gene and the reference tree perfectly represents the vertical history of the organisms, any incon- gruence (conflict) in branching order (topology) between the two must be the result of HGT, some series of gene duplications and losses, or other biological processes (see Kamneva and Ward 2014). Of course, in practice this is much harder because gene and reference trees are reconstructed with imperfect information, limiting the certainty in the resulting topologies. Three key steps are a staple of phylogeny-based HGT detection methods, these are: identification of homologues of a particular gene(s) in multiple genomes, tree reconstruction based on an alignment of the sequences, and detection of incon-

Chapter 2 2.1.1 44 Chapter 2 2.1. INTRODUCTION gruence with a species tree. Limited breadth of taxa from which to draw ortholo- gous relationships, poor alignment due to highly heterogeneous sequences, and poor choice in phylogeny-building parameters can all lead to poor quality phylogenies that may result in incorrect inference of evolutionary history. Imperfect information can also lead to downstream misprediction, for example when there are two equally likely evolutionary paths based on the underlying data, but of course only one is true. This is further complicated by biological processes, such as incomplete lineage sorting (ILS), that resist the bifurcating tree model employed to try to make sense of evolution. I continue the discussion of the pitfalls of the phylogenetic method below, but in short, the aim is to as carefully as possible ensure that gene trees faithfully represent genic evolutionary history and the reference tree truly represents vertical descent. Phylogeny-based methods may also struggle with detecting gene exchange between closely related taxa as the phylogenetic signal in such scenarios will closely resemble vertical descent, increasing uncertainty. Secondly, orthology is typically determined on a gene-by-gene basis and so the history of associated regulatory re- gions often remains unknown, though some pioneering work has made progress in this direction (Oren et al. 2014). Lastly, scaling phylogeny-based methods to larger datasets can be prohibitively computationally expensive, often resulting in a need to trade potential accuracy for speed. However, despite these obstacles, phylogenetic methods offer several advantages, notably the ability to infer older transfer events despite sequence amelioration.

Compositional or phylogenetic: suitability for this work

For this work, I wanted to establish a robust set of HGT predictions across GPA evolution - including ancient HGT events predicted at, or near, the base of the group which, based on the most conservatively recent estimate (Marin et al. 2017), diverged from the rest of the Bacillaceae over 300Ma. Given the age of the clade,

Chapter 2 2.1.1 45 Chapter 2 2.1. INTRODUCTION many HT genes contributing to early GPA evolution would likely be beyond the detection threshold of compositional methods. Phylogenetic methods are capable of identifying such ancient events by establishing orthology based on the amino acid sequences of the gene product rather than nucleotide sequences. Unlike the underlying nucleotides that are more prone to mutation due to codon redundancy, peptide sequence is more constrained by protein function and can remain highly distinct over longer evolutionary timescales Furthermore, once armed with a gene and species tree, some phylogenetic methods of HGT inference allow the entire evolutionary history of the gene to be unravelled - including predicted speciations, duplications, losses, and transfers of the gene in all species represented in the phylogeny. Importantly, such approaches can also predict both the donor and receptor branches that participated in the transfer and so permit for HT genes to be stratified not only in (genomic) space but also in evolutionary time. Finally, focusing on gene flow into a single clade permits some shortcuts in what can otherwise be a computationally intensive workflow (see below). Thus, I considered a phylogenetic approach to be more suitable for detect- ing HGT in this work. However, I did incorporate compositional methods in one stage of my analysis. After HGT inference, I wanted to compare the positions of predicted GPA HT genes to established havens of HGT - genomic islands (see Chap- ter 4, Section 4.3.3). As mentioned above, compositional analysis is well suited to the task of predicting contiguous pockets of high gene turnover across the genome. However, in lieu of performing a de novo search for GIs, I used of a pre-computed resource that combines GI region predictions calculated by four separate composi- tional approaches (IslandViewer4; Bertelli et al. 2017).

Chapter 2 2.1.2 46 Chapter 2 2.1. INTRODUCTION

2.1.2 Specific steps in phylogenetic analysis

Following the brief overview of pitfalls and common sources of error in phylogenetic methods outlined above, in this section I discuss the key stages of a typical phylogeny-based HGT prediction pipeline. For each stage, I highlight potential sources of error and discuss alternative methods in the context of accuracy and speed (where these are available). In cases where I have augmented a particular step to alleviate a source of error, I provide a pointer to the relevant section of this chapter’s results.

At the level of genes and gene families

Gene trees represent the evolutionary history of homologous sequences, whether the homology is shared between different genomes (orthologues) or within the same genome (paralogues). As such, the first step in phylogeny-based pipelines is the identification of homology. This is typically achieved by calculating the similarity between some sequences of interest (e.g. all peptide sequence in the proteome of one organism) and a set of subject sequences (peptides from other organisms). The BLAST suite of local alignment tools (Altschul et al. 1990) is frequently used for this. For smaller datasets, it is feasible, even with limited computational re- sources, to perform an “all vs all” BLAST to identify homologues for each sequence in each genome to sequences in every other genome. However, as the number of genomes grows the number of calculated comparisons grows quadratically (n2). As described below in Section 2.3.3, focusing exclusively on transfers into GPA allows me to minimise the computational cost of the similarity search whilst still maintain- ing a broad phylogenetic search space. Once a network of homology relationships - “BLAST hits” - is calculated, it can be simplified by selecting only those connections that are bi-directional based on their highest homology scores. That is, when BLASTing sequences between two

Chapter 2 2.1.2 47 Chapter 2 2.1. INTRODUCTION genomes A and B, a given sequence X in genome A should have the highest similarity to sequence Y in genome B and vice versa. These connections are termed reciprocal best BLAST hits (RBBHs). The presence of multiple very similar sequences in one genome can prevent RBBHs from being established. The resulting set of RBBHs can be fed into the Markov Chain Algorithm (MCL) (Enright, Dongen, and Ouzounis 2002). This graph-based algorithm clusters the genes based on their RBBH relationships, producing putative orthologous “gene families” (here and further I refer to these clusters as gene families regardless of the underlying sequence type).

Importance of context

Sequence similarity can be detected at different thresholds (e.g. BLAST E- values) - and modulating this cut-off score will influence the size of the resulting gene families. Highly divergent or novel proteins may find only a handful of homologues at the most lenient thresholds. At the same low threshold, gene families of highly conserved proteins (such as ribosomal proteins) may detect homologues in each genome, levying elevated computational costs in later steps. More significantly, including spurious or partial homologues found at low thresholds can negatively impact sequence alignment and result in poor trees. There cannot, of course, be a definitive similarity threshold for the same reason that, in the absence of perfect historical information, there is no definitive cut-off for what makes a homologue. In my work, one important aspect of this issue is that downstream HGT inference into a single clade benefits from phylogenetic “context”. I define context as the leaves of the phylogeny surrounding a group of interest in a particular gene tree. For example, referring to Figure 2.1, for some set of GPA sequences (red tips), the nearest non-GPA tips make up the immediate context (Context 1), while the furthest tips from GPA give the loosest context (Context 3). The amount of context available in a gene tree may be insufficient to fully resolve

Chapter 2 2.1.2 48 Chapter 2 2.1. INTRODUCTION

Figure 2.1: Phylogenetic context in HGT inference. Dashed boxes show three different breadths of phylogenetic context, from 1 (the narrowest context) to 3 (the broad- est). The sequences included in Context 1 are insufficient to determine whether the gene originated in the GPA clade (red) or was transferred from the non-GPA group (blue). At Context 2, there is strong evidence for GPA to have received this gene from the non-GPA group by HGT. Context 3 provides no added benefit over Context 2 in determining gene history with respect to GPA. evolutionary history. Again referring to Figure 2.1, consider a gene tree which is restricted to just Context 1 – in this case it is not possible to tell whether the gene originated within GPA and transferred to the non-GPA genomes, or vice versa. Expanding the gene tree to Context 2, increasing the context, clarifies that this

Chapter 2 2.1.2 49 Chapter 2 2.1. INTRODUCTION gene was likely transferred into GPA from the non-GPA group. However, increasing context further (Context 3) does not provide any extra information in terms of the history of this gene with respect to GPA. To summarise, too few homologues might make it impossible to predict the directionality of a transfer. However, beyond a certain point, inclusion of more distantly related sequences is unlikely to provide any additional information, but will increase computational cost. This concept of context is only applicable when the focus is HGT inference into a specific group; including more sequences will always provide more information if the entire evolutionary history of a particular gene is of interest. So, in this work I again leverage my desire to detect transfers into a single clade to modulate parameters in establishing homology. This allows me to retain maximum context for those gene families requiring it while reducing the size of larger gene families (having more than sufficient context) for computational tractability (see Section 2.3.3). This concept of phylogenetic context is also an important consideration when considering the number (and breadth) of taxa included in the initial homology search space. In practice, more genomes should reduce the number of orphan genes (those not finding homologues); however, this comes at a cost of increased complexity in constructing the reference tree.

Alignment and gene tree construction

Gene tree quality in large part depends on the underlying multiple se- quence alignment (MSA) (Ogden and Rosenberg 2006). However, MSA quality can be difficult to assess. Thus, most efforts at improving alignments are preventative measures - to a priori remove highly heterogeneous sequences from a gene family that might harm alignment. This can be achieved by setting sequence identity or length thresholds, although the benefit to alignment quality and downstream tree reconstruction is often assumed rather than empirically demonstrated and excluding

Chapter 2 2.1.2 50 Chapter 2 2.1. INTRODUCTION heterogeneous alignment columns might actually be detrimental (Tan et al. 2015). In this work, rather than levying arbitrary restrictions, I once again allow context to determine inclusion. By this logic, divergent sequences in otherwise highly conserved gene families will be omitted as they are in excess of the context requirement. Con- versely, in sparsely populated families even divergent homologues will be included in the alignment in an attempt to gather sufficient context - even if at the potential cost of alignment accuracy. Beyond the alignment and data sparsity, gene tree accuracy then largely depends on the algorithms used in their reconstruction and the parameters chosen. Amongst non-Bayesian approaches, heuristic maximum likelihood (ML) methods, such as implemented in RAxML (Stamatakis 2014), are considered ‘gold standard’. A further advantage of RAxML lies in its ability to select the most appropriate model of substitution for each alignment, simplifying the often-difficult task of choosing an appropriate model to be used across the entire dataset. However, such ML methods can become computationally costly for larger alignments and very hard to implement for the largest gene families (4000+ sequences), even when tweaking internal parameters that increase speed at a small cost to accuracy. FastTree (Price, Dehal, and Arkin 2010) presents a much faster mixed- method approach. Although Price and colleagues admit that accuracy suffers at the gain of high speed, they highlight that FastTree consistently found the vast majority (96%-98%) of highly supported splits found by RAxML. Thus, if HGT detection was only considered in the context of high confidence branchings, FastTree might provide a reasonable and high-speed alternative. I evaluate these two alternatives side by side by their impact on downstream HGT inference in Section 2.3.9. Even if alignment and tree reconstruction are robust, error and uncertainty in the phylogeny can come from a number of sources. These can be driven by under- lying biological processes such as ILS, where long term maintenance of an ancestral polymorphism in a population may result in ambiguous phylogenetic signal in rapidly

Chapter 2 2.1.2 51 Chapter 2 2.1. INTRODUCTION speciating descendants (see Galtier and Daubin 2008), and homologous recombina- tion in ancestral populations (Retchless and Lawrence 2010). Alternatively, lack of sufficient signal in the alignment can lead to uncertainty at different depths of the phylogeny. For example, highly conserved genes, such as those encoding ribosomal proteins, may share complete sequence identity between strains of the same species, or even species within a genus. Since a distinct relationship cannot be resolved, and most downstream analyses require strictly bifurcating trees (each non-terminal branch has two descendants), the branching order is determined randomly. Both this, and the contaminant biological processes above, may lead to incongruence with the reference tree resulting in false HGT inference. In cases where poor phylogenetic signal comes from the alignment, uncer- tainty in the branching can be calculated using bootstrapping (Efron, Halloran, and Holmes 1996; Felsenstein 1985). Below, I describe how bootstrap values can be used at the HGT inference step to avoid spurious predictions.

The reference tree

A highly accurate gene tree is only as useful for HGT inference as the null model of vertical evolution to which it is compared - the reference tree. Vertical descent has long been inferred by reconstructing the evolutionary history (gene tree) of a single, ubiquitous gene, such as 16S rRNA or DNA gyrase, but has now largely been superseded by concatenation-based approaches. In the single-gene method, accuracy depends on two assumptions: first, the gene is present in a single copy by all organisms in the dataset and, second, that it is completely refractory to HGT. The first is trivially testable but, while the latter is often considered a reasonable assumption, there is some evidence of HGT amongst highly conserved operational genes (Tian et al. 2015). Concatenation extends the single-gene approach by individually aligning multiple highly conserved genes and concatenating the alignments into a superma-

Chapter 2 2.1.2 52 Chapter 2 2.1. INTRODUCTION trix that is then used to reconstruct the phylogeny. Providing more phylogenetically informative sites can improve resolution at both shallower and deeper nodes; how- ever, inclusion of more gene families may introduce greater error from hidden HGT and ILS. Partially in an attempt to account for ILS, methods modelling the mul- tispecies coalescent have gained in popularity despite findings that suggest that ILS-derived error is comparable in both concatenation and “ILS-sensitive" methods (Tonini et al. 2015). Modern coalescent-based methods can provide accurate reconstruction even in the presence of HGT and missing data (Davidson et al. 2015), providing a tractable alternative to searching for 1-to-1 orthologues that cannot guarantee an exclusively vertical history. In addition, ML estimation of a concatenated alignment spanning thousands of sites, and thousands of taxa, is computationally costly while coalescent-based methods can produce a reference tree in a fraction of the time. In Section 2.3.5, I combine a coalescent-based method with a concatenation-based approach to estimate the reference topology of over 5000 taxa without sacrificing resolution at the species level amongst the organisms most closely related to GPA.

Comparing HGT inference

Assuming accurately assembled gene and reference trees, detecting and defining the incongruence between the topologies can be achieved in several ways. At the most fundamental level, statistical tests of topology, such as Kishino-Hasegawa (KH, Hasegawa and Kishino 1989; Kishino and Hasegawa 1989), Shimodaira-Hasegawa (SH, Shimodaira and Hasegawa 1999), and the Approxi- mately Unbiased (AU, Shimodaira 2002) tests, explicitly consider the likelihood of the gene tree topology being drawn from the same distribution as the reference tree. Assuming well modelled trees, rejection of the null hypothesis implicates non-vertical evolutionary events in the gene history, but prediction of what and where these might be is reserved for other methods.

Chapter 2 2.1.2 53 Chapter 2 2.1. INTRODUCTION

Genome spectral approaches rely on decomposing the reference tree into either bipartitions (by removing one internal edge; Lockhart, Penny, and Meyer 1995) or quartets (e.g. as used in Bansal, Gogarten, and Shamir 2010) and investi- gating whether these subsets are present in the gene tree. Bipartitions or quartets that conflict between the reference and gene trees can indicate the presence of HGT. Bootstrap support of the putatively transferred branches can be used to modulate prediction sensitivity. Another method, subtree pruning and regrafting (SPR), involves removing one internal edge from the reference tree and regrafting it onto another location. If the gene and reference trees were initially incongruent, one or more such operations can result in congruence and indicate possible HGT events - as well as the putative donor and recipient branches. Thus, a sequence of SPR events can map the non- vertical evolutionary history of the gene, and HGT predictions can also be tempered by bootstrap support. Model-based reconciliation (Goodman et al. 1979; Page 1994), attempts to explain the differences between the gene and reference trees within an explicit framework of evolutionary events - typically duplications (D), transfers (T), and losses (L). Parameters describing the evolutionary likelihoods of D, T, and L events are typically provided to the reconciliation algorithm a priori and may determine what combination of DTL events will reconcile a particular topological discrepancy. A higher T:L ratio will result in a predicted evolutionary history dominated by multiple loss events as opposed to HGTs. Reconciliation suffers from this necessity to provide evolutionary costs, which introduces an uncomfortable arbitrary value into the method. Although some algorithms require no a priori costs, and estimate the parameters from the data using optimisation criteria (Merkle, Middendorf, and Wieseke 2010), these are no more likely to be intrinsically correct in a biological context. So what costs should be used? It is often assumed, given the purifying nature of microbial selection, that

Chapter 2 2.1.2 54 Chapter 2 2.1. INTRODUCTION the relative cost of a transfer should be higher than that of a loss - a finding supported by David and Alm (2011). In that work, the authors determined relative D and T costs (2 and 3 respectively, where L = 1) that minimised the flux in size of inferred ancestral genomes. Nevertheless, despite these guidelines, true costs must vary be- tween types of HT gene, host genomes, ecological factors, and timing. They are essentially unknowable. However, work on simulated and real datasets has demon- strated that reconciliation algorithms can be robust to cost misspecification, and accuracy can be further improved by applying concurrent gene tree rearrangement on poorly supported branches to avoid over-prediction (Nguyen et al. 2013).

Final considerations

The genes to gene family to gene tree pipeline is well established in the field and has, with some variation, been successfully applied for well over a decade in many works. Beyond my own ad hoc augmentations, which I will describe in de- tail below, the major decision in methodology lies at the level of gene tree / species tree comparison. Given my emphasis on conservative, rather than comprehensive, detection of HGT events, the cost-specification of reconciliation methods was a dis- tinct advantage. By modulating costs, and so the likelihood of predicting a transfer, I could choose to detect only the most robustly supported transfers. Coupled with local rearrangement of poorly supported gene tree branches, reconciliation presented as the clear choice for evaluating gene tree / species tree conflict.

2.1.3 Chapter aims

My aims for this chapter were to apply existing phylogeny-based methods, augmented where possible for speed and accuracy, to detect HGT into GPA. As part of HGT inference, I also wanted to identify gene families that followed a strictly vertical evolution into GPA as a point of comparison in both evaluating the quality of HGT detection and spatial analysis. Finally, I wanted to temporally separate

Chapter 2 2.1.3 55 Chapter 2 2.1. INTRODUCTION my HGT predictions into more recent and older HGT events. This would allow me to determine what selective forces operate on gene gain in the short term or more latently.

Chapter 2 2.1.3 56 Chapter 2 2.2. METHODS

2.2 Methods

Unless otherwise specified, all computational analyses were performed us- ing scripts written by me in either R (v.3.1-3.4) or TCL (v.8.6). For TCL script- ing, only the base language was used. For R scripting, I made extensive use of the tidyverse suite of packages (Wickham 2017). Unless otherwise specified, plots were produced using the ggplot2 (v.2.2.1.900) package in R (Wickham 2016). Many colour schemes for R plots in this work were chosen from the wesanderson R package (Ram and Wickham 2018). Unless otherwise specified, significance val- ues for difference in means were calculated by applying the Wilcoxon test using stat_compare_means function from the ggpubr R package (Kassambara 2017). Where necessary, plots were recoloured and relabelled in Adobe Illustrator CS6. The above applies to all subsequent results chapters, as well as preceding figures.

2.2.1 Genome download and processing

Genomes (N = 5073, see Section 2.3.1 for selection criteria) were down- loaded in the gbff format from the National Center for Biotechnology Informa- tion (NCBI) ftp server using a custom R script utilising the aria2 download utility (https://aria2.github.io/). The amino acid sequences associated with the cod- ing sequence (CDS) features were extracted from the gbff files into FASTA-formatted proteomes with the following qualifiers stored as the FASTA header: non-redundant NCBI protein identifier (protID), locus tag, gene name, product function, gene loca- tion. The gbff-to-FASTA conversion was done using the genbank_to_fasta.py script (http://rocaplab.ocean.washington.edu/tools/genbank_to_fasta/). Any se- quences that lacked valid protIDs were excluded at this stage (636623 of 18.25 million (M); 3.5%). Where multiple identical protIDs were present in the proteome, the first was retained and the remainder transiently masked for later re-addition.

Chapter 2 2.2.2 57 Chapter 2 2.2. METHODS

2.2.2 Sequence databasing

All 17.6M protein sequences and associated data were assembled into an SQLite relational database (v.3.19.3) to improve lookup speed in further analysis. The database was created with the following parameters to improve speed: main.page_size = 4096, main.cache_size = 10000, main.locking_mode = EXCLUSIV E, main.synchronous = NORMAL, main.journal_mode = W AL. Each protein with a valid protID in every genome had an entry with the following fields: protID, locus, gene start, gene end, gene strand, (T/F), pro- tein product, protein sequence, genome accession and assembly, genome binomial, strain, taxid, genome length, GPA (T/F), nucleotide sequence, functional category (COG), gene family ID. Data was retrieved from the SQLite database in TCL by using the sqlite3 package (v.3.13.0, https://www.sqlite.org/tclsqlite.html) and in R using the RSQLite package (Müller et al. 2017).

2.2.3 High performance computing

The majority of the computationally intensive calculation was undertaken using the Imperial College London high performance computing (HPC) resource, now the Imperial College Research Computing Service (2018). Calculations were performed on a shared memory system, running PBS Professional workload man- ager. Input scripts were written in the TCL and Bash programming languages.

2.2.4 Program parameters

Unless otherwise stated, all computational steps were carried out on a shared memory cluster administered by the Imperial College Research Computing Service. All BLAST calculations were performed with blastp from the BLAST+

Chapter 2 2.2.4 58 Chapter 2 2.2. METHODS

2.2.28 suite with the following non-default parameters: -outfmt 6 -max_target_seqs 1 -max_hsps_per_subject 1 -seg “yes" -soft_masking true -num_threads 1. See below for values of the –evalue parameter). Gene family clustering was performed with MCL (v. 14.137, Enright, Don- gen, and Ouzounis 2002) and the following non-default parameters: –abc -I 2.0 -te 20 -scheme 7 –abc-neg-log10 -abc-tf “ceil(200)". Prior to clustering, I changed all RBBH E-values that were equal to 0 (which represents the highest homology cal- culable by BLAST) to 1E-200. This was done because the next highest homology score is 9.9E-199, and conversion of 0 to 1E-200 recovers consistency in the scoring. MSA for the GPA-centric dataset (Section 2.3.4) was performed with Clustal Omega (v.1.2, Sievers et al. 2011) with default parameters. MSA for the concatenation-based species tree reconstruction (Section 2.3.5) was performed using MUSCLE (v.3.8.31, Edgar 2004) with default parameters. Gene tree reconstruction was performed with FastTree (v.2.1.8, Price, De- hal, and Arkin 2010) with the following non-default parameters: -gamma -boot 100. For gene tree benchmarking (Section 2.3.9) the FastTree gene trees were recon- structed as above, the RAxML trees were built with RAxML (v.8.1.17, Stamatakis 2014) and the following parameters: -f a -D -m PROTCATAUTO -p [random int] -x [random int] -N 100 -T 4. I used ASTRAL-II (v.5.5.9, Mirarab and Warnow 2015) to assemble the species tree with the following parameters: -Xms4g -Xmx60g. Concatenation-based reference trees for the Bacillaceae were built with RAxML v.8.2.10 and the following parameters: -f a -m PROTCATAUTO -p [random integer] -x [random integer] -N 100 -T 8. Reconciliations were performed with Mowgli (v.2 and SVN revision num- ber 486.500.398) Doyon et al. 2010; Nguyen et al. 2012, 2013) with the following parameters: -v -n 1 -T 80 -t [varied] -d 2 -l 1.

Chapter 2 2.2.4 59 Chapter 2 2.3. RESULTS

2.3 Results

2.3.1 Two genomic datasets

Over the course of developing the HGT inference pipeline outlined in the rest of this section, I made use of two separate datasets. The initial dataset was based on a set of genomes downloaded from the NCBI Refseq database in October 2014, and included 3327 bacterial, 403 eukaryotic, and 223 archaeal genomes. The bacterial dataset included 19 Geobacillus and Parageobacillus genomes (at time of download, the recognised genus name for Parageobacillus was Geobacillus). Where multiple assemblies were available, the most complete and most recent were down- loaded; however, there was no minimal level of assembly required. This resulted in a large number of bacterial genomes that were only available at the contig level, including 7 of the 19 Geobacillus genomes, and so was not optimal for downstream spatial analysis, which later became the main focus of my work. To provide a more robust dataset for spatial analysis and make use of new fully assembled genomes, I sourced a second set of bacterial genomes from NCBI Ref- seq in November 2017. I first filtered all available genomes for only those assembled at the “Complete Genome" level (8296 of 100270). These 8296 genomes contained 5073 unique taxonomic identifiers (taxids), each representing a single taxon. For this dataset, I wanted to ensure that each taxon was represented by a single genome. I found that 4577 taxids were represented by a single assembly, with the remaining 496 having two or more genomes assembled to the “Complete Genome” level. In 360 of 496 cases the latest assembly was chosen, based on the assumption that newer as- semblies are of higher quality – having presumably been produced by more modern pipelines. In the remaining 136 of 496 cases, in which multiple assemblies shared a single latest assembly date and so there was no rationale for picking one or another, a genomic assembly was selected at random. Thus, the final dataset comprised 5073 genomes each representing a single taxid and included 25 GPA genomes; the assem-

Chapter 2 2.3.1 60 Chapter 2 2.3. RESULTS bly did not need to be selected randomly for any GPA taxid. This dataset was as taxonomically broad (within the bacteria) as the requirement to have an assembled genome would allow, and as such would provide the best opportunity to identify orthologous relationships with sufficient context to infer HGT. It is the 5073-genome dataset that serves as the basis for the results pre- sented in the remainder of this work. Nevertheless, the first dataset proved in- valuable in benchmarking the HGT detection pipeline and selecting (or developing) appropriate methods to enhance speed without compromising accuracy, as will be discussed below. Unless explicitly stated otherwise, all quantitative and qualitative results are based on the more recent dataset.

2.3.2 Genes to proteins, proteins to homologues

Protein sequences were extracted and pre-filtered from each of the 5073 genomes (for details see Section 2.2.1). Importantly, duplicate proteins within each genome (N = 123118 across all genomes) were masked at this stage to simplify homology detection and re-added after clustering. Data for all proteins (N = 17.6M) was collated and held in an SQLite database for improved lookup speed in further analysis (see Section 2.2.2). The total set of chromosomal and plasmid GPA proteins with valid protIDs was 83468, of which 1221 duplicates were masked.

Identifying homology

As my focus was to detect gene flow into the GPA clade, it was unnecessary to conduct a full all-versus-all homology search and instead I adopted a GPA-centric approach. For this, each GPA proteome was reciprocally BLASTed against every other proteome in the dataset (at a permissive E-value cut-off of 1E-10). Effectively, this was equivalent to an all-versus-all BLAST for the 25 GPA proteomes as well as a bidirectional BLAST for each GPA proteome against every other non-GPA proteome. Critically, this excluded BLASTs between any two non-GPA proteomes

Chapter 2 2.3.2 61 Chapter 2 2.3. RESULTS which would provide homology links for non-GPA proteins - not needed for this analysis. This required a total of 253000 BLAST calculations compared to the 25735329 that would be needed for an all-versus-all BLAST (<1%). Paralogues were identified for each GPA proteome by performing a BLAST of the proteome against itself. Any non-self BLAST hits were considered possible paralogues. Sequences were considered true paralogues if the blast bitscore of a putative paralogue pair was greater than or equal to the maximum bitscore between either protein sequence and any orthologue. Of the 167 true paralogue pairs within GPA species, 108 (65%) were identified as transposases - highly promiscuous mo- bile elements. This enrichment is unsurprising given the ability of transposases to duplicate within host genomes (see Darmon and Leach 2014).

2.3.3 Homologues to gene families

Identifying homology

RBBHs for each pair of proteomes were then identified. All orthologue and paralogue RBBH connections were collated and subjected to clustering with MCL, using the E-values of the connections as graph edge weights. With all connections considered, MCL produced 7733 clusters corresponding to putative gene families - comprising 6535215 proteins. Next, previously masked duplicate protIDs (N = 38831) were re-added into clusters containing their unmasked duplicates (also see Table 2.1). At this stage, a total of 82461 (98.8%) GPA proteins were present in the putative gene families.

Paralogues as a measure of cluster quality

I then compared the distribution of previously identified in-paralogues as a qualitative measure of clustering with the expectation that a pair of paralogues should belong to the same gene family cluster. Of 167 true paralogues pairs identified in GPA at 1E-10, 28 (16.7%) were split between gene family clusters; 20 of these

Chapter 2 2.3.3 62 Chapter 2 2.3. RESULTS

(71%) were identified as transposases. Although I did not explore this explicitly, I suspected that this observation of paralogue splitting is due to the promiscuity of this gene type. High incidence of transfer and duplication is certain to provide a greater- than-average number of strongly weighted connections between either paralogue and many orthologous sequences. The nature of the RBBH approach means that only one paralogue within a genome will be able to hit a particular orthologue and the expectation is that the very strong edge weight connecting the two paralogues will link the two subclusters (Figure 2.2, left). In the presence of a large number of strongly weighted edges connected to orthologues, the paralogue link may be broken in favour of making two smaller clusters (Figure 2.2, right). As a result, I would not consider the observed rate of paralogue misclustering to be representative of the clustering quality in the rest of the dataset. Nevertheless, given the expectation that these clusters should not be split, gene families containing misclustered paralogues were conservatively removed. Note that some gene families contained more than one misclustered paralogue, thus the 56 paralogues (28 pairs, above) were misclustered to 52 gene families (also see Table 2.1). Finally, any gene family that contained fewer than four protIDs was removed as these cannot be subjected to phylogenetic reconstruction (Table 2.1). This left 6726 gene families.

Figure 2.2: Clustering of paralogues. Illustration of how true paralogues may be misclustered in the presence of a high number of closely related orthologues. Left panel: a single cluster (dashed red oval) is predicted when the homology between true paralogues (P1, P2) strongly outweighs (thick black line) each paralogue’s homology to orthologous sequences (O1..On). Right panel: two clusters are predicted when many closely related orthologues are connected to each paralogue, diminishing the importance of the paralogue- paralogue connection.

Chapter 2 2.3.3 63 Chapter 2 2.3. RESULTS

Reducing gene family size

The trailing distribution of proteins included in the gene family clusters (Figure 2.3A) made it clear that a large proportion of total proteins lay in the largest gene families. I describe the importance of context in HGT inference above (Section 2.1.2): specifically, how beyond a certain size the number of genes in a gene family should no longer improve HGT detection into GPA. In light of this, I developed an ad hoc approach to reduce gene family size while maintaining context for those gene families requiring it. To reduce gene family size, I re-identified RBBH relationships at higher stringencies (1E-50, 1E-100, 1E-150) by filtering the existing BLAST results. These RBBHs were then clustered with MCL as above. At the most permissive threshold (1E-10), downstream gene trees would have the greatest phylogenetic context around the GPA clade. With increasing stringency of homology (1E-50, etc.) more distant orthologues would no longer be included in the gene family, and the context would become narrower (refer to Figure 2.1). The aim was to select a threshold, per gene family, that maintained sufficient context around the GPA clade to clearly infer gene history while excluding extraneous sequences and in doing so reduce gene family size. Comparative results of clustering at different thresholds are presented in Table 2.1. However, I could not simply select all gene families clustered at a higher stringency because, critically, the number of GPA proteins included in the final gene families decreases with increasing stringency. It was the desire to include all the GPA protein diversity seen at the most lenient threshold that informed my cluster selection approach.

Table 2.1: Gene family and protein processing at different thresholds

E-value Number of Gene Families Number of Proteins After MCL Misclustered < 4 Sequences Final After MCL Duplicates re-added Final Final (GPA only) 1E-10 7733 52 955 6726 6535215 38831 6547127 79719 1E-50 7596 42 1750 5804 3527387 17425 3526199 70274 1E-100 6963 30 2148 4785 1695825 6487 1694873 56640 1E-150 6203 21 2358 3824 853046 3901 850674 44081

Chapter 2 2.3.3 64 Chapter 2 2.3. RESULTS

Several fates were possible for the GPA proteins present in a given cluster at the 1E-10 level and these are graphically summarised in Figure 2.3B. Referring to Figure 2.3B, the following scenarios are possible:

• All the GPA proteins could be present in clusters found at higher stringency thresholds (1).

• The GPA proteins could map to a cluster that contains more GPA proteins than expected, suggesting the coalescence of two or more clusters (2 and 3).

• The GPA proteins could map to a cluster that contains fewer GPA proteins than expected, suggesting either the complete loss of a homologue or splitting of clusters (4 and 5).

Thus, I could select a cluster at an appropriate threshold as long as the fate of the GPA proteins could be mapped with fidelity (ticked in Figure 2.3B). Additionally, the total number of sequences at that stringency level had to be greater than a semi-arbitrary cut-off I chose as sufficient for HGT inference downstream (200 protein sequences). Interestingly, there was a single case where a gene family clustering at a higher stringency contained more GPA proteins but those proteins were not present at the 1E-10 level in any gene family. These proteins had been removed at 1E- 10 for belonging to gene families with misclustered paralogues. However, at the higher stringency these in-paralogues were no longer misclustered leading to an odd scenario where a more stringent cluster had GPA diversity unrepresented at the base level. To maximise overall GPA diversity, this edge case was allowed - resulting in an overall increase of GPA proteins in the final set of gene families (79719 to 79741). Applying this gene family reduction approach significantly decreased the total number of proteins in the dataset (from 6547127 to 3789584, 58%; Figure 2.3C) whilst maintaining the same number of clusters, GPA diversity, and minimum context for downstream HGT inference.

Chapter 2 2.3.4 65 Chapter 2 2.3. RESULTS

Figure 2.3: Gene family size reduction. (A) The number of sequences across all predicted gene families at the most permissive threshold, 1E-10. (B) Gene family reduction approach. Each row represents a single gene family, and how the number of GPA sequences included in the family changes across clustering at different thresholds (columns). Red arrows show connections disallowed by my selection criteria, and ticks indicate the chosen stringency for each gene family. Refer to the text for more detail. (C) The impact of the reduction approach on the number of sequences in gene families (orange line) compared to the clusters at the base threshold (black line, as in A). Dashed red line indicates the minimum sequences in each gene family, below which further family reduction was not performed (N = 200). (D) The impact of gene family reduction on the variation of sequence length within gene families.

Chapter 2 2.3.4 66 Chapter 2 2.3. RESULTS

2.3.4 Gene families to gene trees

Protein length heterogeneity

As described in Section 2.1.2, it was my hope that the above reduction approach would reduce sequence heterogeneity within individual gene families to improve alignment, removing the need for arbitrary sequence identity or length cut-offs. To see whether this was indeed the case, I calculated the coefficient of variation (CV) for the protein sequence lengths in each gene family - both for the non-reduced gene families clustered at 1E-10 and the reduced set derived above. The flatter, broader distribution of the CV values seen of the reduced gene families in Figure 2.3D demonstrates that gene families in the reduced set are comprised of more homogeneously sized sequences. This is summarised by the significantly lower mean CV in the reduced set (p < 2.2e-16, Wilcoxon test). Interestingly, several gene families with the largest CV values (>150%), are identified as adhesins (including filamentous hemagglutinins). These can be long, multi-domain, polymeric peptides (e.g. see Table 3 in Berne et al. 2015) and so significant heterogeneity in sequence length is expected.

FastTree gene trees

Sequences were aligned using Clustal Omega. Of 6726 input gene families, three failed to align within a 72-hour walltime - a practical cut-off for the HPC resource used to produce the alignments (see Methods, Section 2.2.3). These three gene families were discarded from further analysis. Gene trees for all alignments were reconstructed with FastTree (see Methods, Section 2.2.4). The accuracy of the latter in the context of HGT inference in this work is critically evaluated below (Section 2.3.9).

Chapter 2 2.3.5 67 Chapter 2 2.3. RESULTS

2.3.5 Reference tree reconstruction and local correction

A priori HGT filter

Although coalescent-based methods are resistant to the presence of HGT signal, I wanted to reduce the amount of HGT signal being fed into the algorithm. How can this be done without a priori HGT inference? I found that 1824 of the 6723 gene trees contained at most one sequence per taxon (no paralogues) and in turn represented 5070 of the 5073 taxa present in the entire gene tree set. I reasoned that although duplicate copies of a gene may be the result of non-HGT forces, paralogy is often associated with the most promiscuous gene classes (such as transposases). By removing the most HGT-rich gene classes, I expected to increase the ratio of vertical:HGT signal in the gene trees from which the species tree was assembled. Limiting the input gene tree set came at a cost of three taxa: “Candidatus Tremblaya princeps” (strains PCIT and PCVAL) and Mycoplasma wenyonii str. Massachusetts. These gene trees were then fed into the ASTRAL-II software, producing a tree with 5070 tips (see Methods, Section 2.2.4). This reference reconstruction was then midpoint-rooted and rendered into a cladogram. In this representation, all terminal branches are aligned but branch lengths do not correspond to any measure of divergence. This cladogram mimics the expected ultrametric tree, required by the Mowgli reconciliation algorithm, and represents an acceptable alternative to reconstructing a pan-bacterial time-resolved phylogeny, which was outside the scope of this work. I account for some error associated with using a cladogram as a reference tree in Section 2.3.6. Note that, for our purposes, the precise position of the root is not important as long as it falls outside the Bacillaceae.

Chapter 2 2.3.5 68 Chapter 2 2.3. RESULTS

Importance of context - the local group

At this stage it was important to ensure that GPA and closely related species (herein “local group”, operationally defined as the Bacillaceae) were correctly clustered together in the reference tree. If the local group is incorrectly resolved in the reference tree, gene trees with truly vertical histories into GPA are likely to give spurious HGT signals. The necessity for reference tree accuracy decreases with increasing distance from GPA - as long as a vertical history can be rejected, the accurate position of a putative HGT donor outside is not crucial. It must be stressed here that this compromise is only applicable in the context of detecting HGT into GPA or any other single group. Inference of evolutionary history between other groups would likely be error-prone. To summarise, by ensuring the accuracy of the GPA and local group phylogenies:

• A vertical history of GPA genes, meaning they were inherited from the latest common ancestor of GPA and other Bacillaceae, can be accepted or rejected with accuracy

• The branch of HGT gain into GPA can be predicted with accuracy

• The identity of the HGT donor branch is less likely to be accurate

• Accuracy of predicted evolutionary events not into GPA will be poor

Correction using a concatenation-based method

To confirm and, if necessary, refine the topology of the local group I recon- structed a species tree for the Bacillaceae family (which includes GPA, see Figure 1.1) using an orthogonal concatenation-supermatrix approach. I selected all pro- teomes from within my dataset that taxonomically corresponded to the Bacillaceae (N = 215, including all 25 GPA). I performed an all-versus-all BLAST followed by RBBH identification (including in-paralogue identification for each proteome) and

Chapter 2 2.3.5 69 Chapter 2 2.3. RESULTS

MCL clustering at two E-value thresholds (1E-10 and 1E-50). At the 1E-10 level, I identified 30904 gene families comprising 924282 proteins which included 82054 (of the 83468; 98.3%) GPA proteins. After removing gene families containing misclus- tered paralogues (N = 301) and those with fewer than four protein sequences (N = 10761), I identified all gene families that comprised one-to-one orthologues (N = 229). By requiring each taxon to be represented by exactly one protein the hope is to select highly conserved (perhaps essential) genes and so maximise vertical signal. Although the presence of some small amount HGT cannot be ruled out, it should be overwhelmed by the vertical signal. At the 1E-50 threshold, 158 1-to-1 orthologous gene families were identified. By reconstructing the species phylogeny at a more stringent E-value threshold, I concentrate on more highly conserved gene families (and so less likely to contain HGT) at the cost of a reduced amount of input signal. The gene families were individually aligned and the alignments concate- nated into a supermatrix. Species phylogenies were reconstructed using both RAxML and FastTree. On comparison, I found that at each threshold the RAxML and FastTree reconstructions were identical. Further, I found that the 1E-10 and 1E-50 phylogenies were highly congruent. The only difference observed in the GPA phylogeny between the ASTRAL, 1E-10 and 1E-50 reconstructions was in the exact position of Geobacillus sp. GHH01 (Figure 2.4). To resolve the discrepancy, I reconstructed yet another GPA phylogeny (using a concatenation-based approach) that included a broader set of genomes (including those without fully assembled genomes, and so excluded from my dataset). This more comprehensive phylogeny strongly supported G. sp. GHH01 as an outgroup to both the G. kaustophilus and Geobacillus sp. C56-T3 subclades - the same branching pattern observed in the 1E-50 Bacillaceae reconstruction. On the basis of this, the GPA phylogeny in the ASTRAL-II reference tree was corrected to be consistent with the 1E-50 reconstruction. Beyond the discrepancy in G. sp. GHH01 position, the ASTRAL-II and

Chapter 2 2.3.5 70 Chapter 2 2.3. RESULTS

Figure 2.4: Reference tree refinement. Comparison of GPA phylogenies reconstructed using a coalescent-based approach (ASTRAL) and a concatenation-based approach (1E-10, 1E-50). Phylogenies shown are pruned from broader reconstructions: for ASTRAL, pruned from the 5070-tip reference tree; for 1E-10 and 1E-50 pruned from 215-tip Bacillaceae trees. The only discrepancy between reconstructions is in the position of Geobacillus sp. GHH01, highlighted in red. The 1E-50 phylogeny was selected as representative, and the ASTRAL reference tree was refined to correspond to this topology (see text).

Chapter 2 2.3.5 71 Chapter 2 2.3. RESULTS concatenation-derived Bacillaceae phylogenies were remarkably consistent. Most other differences were limited to the exact branching order of closely related strains of the same species and so did not necessitate further correction.

2.3.6 HGT inference with Mowgli

In this section I discuss my approach for HGT detection using the reconcil- iation program Mowgli (MowgliNNI; Doyon et al. 2010). A significant consideration in choosing Mowgli was the implementation of the NNI algorithm capable of rear- ranging poorly supported gene tree branches to minimise reconciliation costs during the calculation - thereby reducing spurious HGT prediction (Nguyen et al. 2012, 2013).

Gene tree preparation

To conform to Mowgli input requirements, I first resolved all multi- chotomies present in the gene trees into dichotomies (strict bifurcation) in a random manner. In 11 of the 6723 gene trees all branch lengths were 0 - i.e. all the underlying sequences shared 100% identity. In such cases no relative relationship can be inferred and these trees were discarded from further analysis. Finally, the remaining 6712 gene trees were midpoint rooted.

Reconciliation and processing

I performed reconciliations at a range of transfer (T) costs to modulate HGT detection: T = 3, 4, 5 and 6 where duplication (D) = 2 and loss (L) = 1. Further, the cost profile T = 3, D = 2, L = 1 will be referred to as simply T3, and so forth for increasing transfer costs. Each gene tree was submitted for reconciliation at each transfer cost, thus a total of 4 ∗ 6712 reconciliations were attempted. I ran a similar reconciliation cost profile when determining HGT in the old genomic dataset (see Section 2.3.1), but extended to include T = 8, 10, 20 to see whether

Chapter 2 2.3.6 72 Chapter 2 2.3. RESULTS

HGT detection at what are probably unrealistically high costs qualitatively altered the results. I discuss this briefly in the context of HT gene functional enrichment in Chapter 3 (Section 3.3.1). In the case of some of the largest gene trees, reconciliations at particular parameters could not complete within the 72 hours of walltime that limited practical scaling on the HPC cluster (T3 = 152; T4 = 164; T5 = 204; T6 = 122). In the majority of cases, these gene families were discarded from further analysis. In a handful of cases, when a gene family was successfully reconciled at some parameters (e.g. successful at T3 and T4, failed at T5 and T6), the successful reconciliations were included in downstream processing as far as the filtering criteria would allow (see Section 2.3.7). I processed the reconciliation output by selecting only those events pertain- ing to GPA evolutionary history. In the first instance I classified each gene family into one of three categories (also see Table 2.2):

1. No HGT event predicted into any GPA branches from non-GPA donors

2. One or more HGT events predicted into a GPA branches from non-GPA donors

3. Gene origin predicted to be from GPA (gene tree rooted in GPA)

Table 2.2: Raw GPA gene history predictions by Mowgli

Evolutionary History of GPA in Gene Family Transfer Cost Input trees Reconciled HGT-free HGT from non-GPA donor Root within GPA 3 6712 6560 2389 4171 853 4 6712 6548 2679 3869 805 5 6712 6508 3007 3501 750 6 6712 6590 3381 3209 687

I make the distinction between HGT events that were between two GPA branches or from non-GPA donors because the former represents the exchange of genes already extant in the GPA lineage. Thus, GPA-to-GPA transfers simply carry forward the initial origin of the gene within the GPA lineage - whether that is from vertical inheritance or derived from an HGT with a non-GPA donor (Figure 2.5A).

Chapter 2 2.3.6 73 Chapter 2 2.3. RESULTS

In situations where a gene was predicted to originate from within the GPA clade (category 3 above) a vertical gene history was impossible to ascertain. In addition, with the tree centred on GPA, most HGT events were predicted to go from, rather than into, GPA. However, in some of these gene families HGT events into GPA were identified (e.g. 60 of 853 at T3) - in situations where multiple GPA subclades were present in the gene tree and the root was within one of these subclades. Nevertheless, having the gene history rooted within GPA was a sign of poor context, and these HGT events were not counted beyond the first filtering step (Section 2.3.7 and Table 2.3).

HGT recipient branch correction

Mowgli’s time-consistent reconciliation method (transfers must occur be- tween temporally overlapping branches) results in an interesting, but uncommon occurrence. Occasionally, a transfer will be predicted into GPA at a more ances- tral branch than might seem parsimonious based on the distribution of extant taxa (solid arrow, Figure 2.5B) - essentially Mowgli is incurring extra costs by includ- ing seemingly unnecessary losses. However, the reference tree is not time-resolved. In other words, the branch lengths do not correspond to evolutionary time and as such Mowgli’s attempts to predict time consistent transfers are misguided in the framework of this analysis. Optimally, I would provide a reference tree that was accurate and dated in its entirety - but the task of reconstructing such a phylogeny was beyond the scope of this project. To ensure that ancient transfers are not over- estimated due to this effect, I applied a branch reduction algorithm in my processing of the Mowgli output whereby the branch of HGT gain was reduced to the most parsimonious depth (dashed arrow, Figure 2.5B). Finally, in the context of predicting branches at which HGT were gained: as mentioned above, the lengths of branches in the reference tree do not correspond to the true evolutionary distances between nodes in the GPA phylogeny. As such,

Chapter 2 2.3.6 74 Chapter 2 2.3. RESULTS

Figure 2.5: GPA-to-GPA transfers and refining the recipient branch (A) GPA- to-GPA transfers carry forward the original source of the gene. In gene tree A the gene was vertically inherited into GPA, followed by a loss (red cross) and subsequent re-acquisition (green arrow) of the gene in a GPA subclade. The gene in the recipient taxa is still considered vertically inherited. In gene tree B, the gene original source of the gene in GPA was HGT from a non-GPA donor, so genes from subsequent GPA-to-GPA transfers are also considered to be derived from a non-GPA donor. (B) Illustration of recipient branch correction (dashed green arrow) if the reconciliation predicts a non-parsimonious recipient branch (solid green arrow) due to the time-consistency requirement built into the Mowgli algorithm. there is no a priori expectation that longer evolutionary distances should accumulate more transfers - at least with respect to this method of reconciliation. I discuss the temporal dynamics of HGT gene gain in Chapter 3 (Section 3.3.3).

2.3.7 Refining HGT and vertical gene history predictions

In this section I discuss how calculating reconciliations at a range of trans- fer costs allowed me to refine HGT predictions to increase stringency. The multiple

Chapter 2 2.3.7 75 Chapter 2 2.3. RESULTS steps of filtering account for both the non-deterministic nature of the Mowgli algo- rithm and potential error as a result of poor phylogenetic resolution between closely related taxa. By having predictions available at a range of stringencies, I later eval- uate to what extent the qualitative patterns of HGT (and to some extent vertical) predictions remain consistent (Chapter 3).

Constancy of prediction

A naive expectation of reconciliation at two transfer costs (e.g. T3 and T4) is that a gene family predicted to have an HGT into GPA at T4 (less likely for transfer to be predicted) should also have an HGT predicted into GPA at T3. In terms of an HGT-free gene history, the likelihood applies in reverse. This test for constancy of either an HGT-free or HGT-containing gene history was done for each gene family, without considering the nature of the underlying transfer events (if present). Thus, for gene families with HGTs predicted at T4 only those with HGTs predicted at T3 were carried forward. For HGT at T5, the filter required ‘constancy’ at T4 and T3. At T3 there was no more permissive transfer cost for comparison so all HGT gene families were carried forward. The same approach was taken for gene families with exclusively vertical gene histories, e.g. gene families with a vertical signal at T5 had to also have an HGT free reconciliation at T6 (most restrictive for HGT and so most permissive for vertical). The results of this and further filtering steps are presented in Table 2.3.

Consistency of prediction

In the second stage of filtering I used the individual transfer events pre- dicted into GPA and consistency was evaluated in two steps. First, I differentiate between transfers arising from either outside or within the local group - ‘long’ and ‘short’ distance HGTs (lHGTs and sHGTs, respectively). Second, I ensure that

Chapter 2 2.3.7 76 Chapter 2 2.3. RESULTS

transfer events are consistently predicted as either lHGT or sHGT across transfer costs. Similarly, I also require that the branch of gene gain within the GPA remains consistent across costs. Each transfer event has predicted donor and recipient branches, and I al- luded to the potential for error in donor prediction given possible inaccuracy in the reference tree outside of the local group in Section 2.3.5. However, a distinction can still be made with confidence between transfers into GPA that originate from either within or outside the local group. I make the distinction between these two HGT sources for two reasons. First, HGTs predicted from closely related taxa are more likely to be the spurious result of poor gene tree reconstruction and so represent a less reliable dataset. Indeed, I find that the number of sHGTs predicted with increasing transfer cost declines more rapidly than lHGTs (Table 2.3), presumably because alternative multiple-loss scenarios become more parsimonious than short distance transfers. Second, valid sHGTs may have different integration biases due to more similar genome signatures, and so could provide a valuable point of com- parison downstream. Table 2.3: Filtering reconciliation results

Input Filtering Post-filter Output Transfer Gene Families Constancy Consistency Events Gene Families Genes Cost Vertical HGT Vertical HGT lHGT sHGT lHGT sHGT Vertical lHGT sHGT Vertical 3 2389 4171† 1398 4111† 2893 2468 2689 2308 1280 13191 14230 27375 4 2679 3869 1768 3629 2396 1769 2219 1661 1615 9791 8998 32102 5 3007 3501 2231 3170 2096 1220 1936 1137 2044 7737 4820 38397 6 3381 3209 2724 2802 1887 907 1749 847 2491 6556 3020 44303 † Reduction due to removal of category 3 gene trees (see Section 2.3.6).

For a transfer to be considered an sHGT, it had to be predicted to originate from a branch within the Bacillaceae portion of the reference tree. The Bacillaceae, as expected (Zhang and Lu 2015), were not monophyletic within the reference tree and transfers predicted from the non-Bacillaceae groups that subtended the Bacil- laceae most common recent ancestor (total of 21 taxa and 20 internal branches) were considered lHGTs.

Chapter 2 2.3.7 77 Chapter 2 2.3. RESULTS

Thus, for each transfer event in each gene family I considered the branch of gene gain and the genes derived from that transfer, as well as the sHGT or lHGT origin of the transfer. For the event to pass the filter at a given transfer cost (e.g. T4), the transfer must have occurred into the same GPA branch, led to the same genes in the same taxa, and have the same lHGT/sHGT origin at a lower transfer cost (T3). As for the previous filter, for a cost of T5 consistency was expected across T4 and T3, and so forth. Finally, I conservatively define vertical genes as those derived from gene families lacking any HGTs (in the above filtering step) and so this set of gene families does not require filtering at this step.

Filtering for plasmid-borne genes

Up to this stage, I have made no distinction (except in initial labelling) between chromosomal and plasmid-borne genes. However, for downstream spatial analysis there was an obvious need to exclude plasmid-borne genes. Here, I removed any HGT events or gene families (for the vertical dataset) that were either fully or partially comprised of plasmid genes (see Table 2.3, under “Post-filter Output”).

All other genes - the “grey zone”

With multiple stages of filtering, a vertical or HT gene history cannot be stringently predicted for many GPA genes. However, this large group of genes with uncertain histories still lie on the GPA chromosomes and form an important point of comparison. As such, they are pooled together into a set unto themselves, referred to as the “grey zone”. For much of the rest of this work, sHGT genes are also conservatively pooled into the grey zone except where lHGT and sHGT genes are explicitly compared.

Chapter 2 2.3.8 78 Chapter 2 2.3. RESULTS

2.3.8 Determining the age of transfer

Armed with a fully refined set of predicted HGT events into GPA, I wanted to make a distinction between older and more recent transfer events. This was motivated by an expectation that recent and ancient acquisitions may not only have different genomic signatures (with recent HT genes not yet ameliorated to the host), but also different spatial biases. To determine whether HT gene location, if at all biased, was enforced immediately on arrival or over long-term selection I needed to determine a set of HGTs that were gained very recently in evolutionary time. Since HGT predictions were already resolved by branch, I chose to anno- tate each branch as either “old” or “recent”. A recent reassessment of the Geobacillus genus identified a set of 16 genomospecies - sets of genomes, with potentially diverg- ing nomenclature, that share sufficient identity to be considered the same species (Aliyu et al. 2016). I used this classification as a proxy for annotating branches in the phylogeny to represent either recent or ancient lineage events: branches sepa- rating genomes within a genomospecies were considered recent, branches separating genomospecies were considered ancient. Thus, HGTs predicted to have been gained at the recent branches were labelled as recent HGTs (light green in Figure 2.6). Where a genome not included in Aliyu et al. fell within an otherwise clear genomo- species group (Geobacillus stearothermophilus 10 and Group I1) it was included in that group. For other genomes not covered by Aliyu and co-authors (red labels in Figure 2.6), I assigned new genomospecies groups. For these, if the distance sepa- rating two or more genomes was less than the distance separating taxa within any Aliyu genomospecies set, they were classified as the same genomospecies (A1 and A3, Figure 2.6). Finally, in cases where a long terminal branch leads to a genome (e.g. Anoxybacillus amylolyticus) there would be some proportion of transfers into that branch that would also be recent; however, with no way to deconvolute these, the entire branch was conservatively considered ancient. The numbers of HGT genes split by age of acquisition are given in Table 2.4.

Chapter 2 2.3.8 79 Chapter 2 2.3. RESULTS

Table 2.4: Classifying old and recent HGT events

Transfer lHGT genes sHGT genes Cost Old Recent Old Recent 3 11382 1809 12483 1747 4 8472 1319 7812 1186 5 6840 897 4052 768 6 5953 603 2594 426

This approach presents a tractable method for determining some propor- tion of HGTs as more recent: those acquired during a short period of lineage evo- lution that has not yet resulted in speciation. Although I continue to refer to the two sets as “old” and “recent” through the rest of the manuscript, the “old” set must contain some proportion of recent HGTs that are not detectable because only one member of the genomospecies is included in this work.

Figure 2.6: GPA assignment to genomospecies groups. GPA phylogeny is pruned from the Bacillaceae species tree reconstructed at 1E-50, see Section 2.3.5. Branches are coloured according to HGT age, old (blue) and recent (green). Bracketed labels correspond to genomospecies group: black - defined in Aliyu et al. 2016, red - in this work.

Chapter 2 2.3.8 80 Chapter 2 2.3. RESULTS

2.3.9 RAxML vs FastTree

I left the comparison of RAxML and FastTree approaches for reconstructing gene trees until the end of this chapter because the evaluation depends on the nomenclature of HGT detection and filtering, introduced above. This comparison was performed with the previous dataset (see Section 2.3.1) which was processed in an identical manner to detect HGTs, except that all gene trees were built with RAxML. Given my focus on deriving a stringent set of HGT events, I wondered to what extent a less accurate (but much faster) tree building algorithm would influence downstream results (post-filtering). The results outlined here provided the rationale for using FastTree on the current dataset. I selected 250 gene families predicted to have exclusively vertical gene his- tories into GPA across costs T = 3, 4, 5 and 6 (most stringent vertical prediction) and 250 lHGT events predicted consistently for transfer cost 4 (T4). Next, I recon- structed the gene trees in FastTree from the same alignments as were used for the RAxML calculation. These gene trees were reconciled and filtered in the same way as before for transfer costs T = 3, 4, 5 and 6. Finally, I compared to what extent RAxML-predicted lHGT / vertical histories are recovered in their entirety in the FastTree pipeline. Critically, I wanted to observe the conversion rate from lHGT predictions to vertical predictions, and vice versa, because a loss to the grey zone would not qualitatively affect downstream analysis. I found that of the 250 lHGT events, 207 were recovered entirely in the equivalently filtered FastTree prediction pipeline. Of the remaining, only 2 previ- ously lHGT-containing gene families (<1%) now had a vertical prediction (Figure 2.7A). Similarly, of the 250 vertical gene families, 202 were recovered as such in the FastTree analysis with only four lHGT events arising (Figure 2.7B). I considered this margin of error to be well within acceptable levels considering the reduction in computational time required for gene tree reconstruction being reduced, from months with RAxML to days with FastTree.

Chapter 2 2.3.9 81 Chapter 2 2.3. RESULTS

Figure 2.7: Comparison of RAxML and FastTree methods (A) FastTree-based (FT) HGT inference of 250 lHGT events predicted using RAxML. (B) FT-based HGT inference of 250 vertical gene families predicted using RAxML.

Chapter 2 2.3.9 82 Chapter 2 2.4. DISCUSSION

2.4 Discussion

In this chapter I have defined and tested a pipeline for stringent HGT identification into the GPA group. I have described how I augmented established orthology-based methods (e.g. OrthoMCL; Li, Stoeckert, and Roos (2003)) to detect HGT into desired model groups, increasing speed without compromising accuracy. Specifically, by sampling homology relationships at different similarity thresholds I have shown that gene family size can be significantly reduced without compromising context for downstream HGT inference. This results in fewer sequences needing to be phylogenetically reconstructed and reconciled, alongside a decrease in the overall sequence length heterogeneity. Next, I showed that a comprehensive reference tree, made using ASTRAL-II, can be corrected using orthogonal concatenation-based approaches. Correction is only critical for the phylogeny of species closely related to GPA and was, for the large part, unnecessary here due to the accuracy of the ASTRAL-II reconstruction. Further, I showed that HGT predictions, derived across multiple cost profiles, can be filtered in a number of steps in an attempt to reduce error from both the nondeterministic reconciliation algorithm and uncertainty in phylogeny reconstruction. The success of the latter is shown in the comparison of HGT predictions derived from RAxML and FastTree gene trees where I found only a small number of transitions between HGT and vertical assignments. The approaches I have developed in this chapter could be easily extended to detect HGT into other groups of interest, and so could prove to be useful in the broader field. The result of this chapter is a set of HT and vertical gene predictions (HT genes are defined as those derived from lHGT events). However, the stringency of these is dependent on the transfer cost supplied to the reconciliation - especially since the filtering steps require consistency across multiple costs. So, confidence in prediction is highest at T6 and lowest at T3, but increasing stringency comes at the cost of a reduced number of genes. In the following chapter, I explore the metrics of my gene history predictions at different costs to help select an appropriate set.

Chapter 2 2.4.0 83 Chapter 3

Assessment of HGT into GPA

84 Chapter 3 3.1. INTRODUCTION

3.1 Introduction

In this chapter I investigate the non-spatial characteristics of the HGT and vertical GPA gene predictions derived in Chapter 2. Importantly, comparison of functional enrichment and differences in nucleotide content with well-established patterns provides some measure of validation of the dataset against prior expecta- tions. I compare the robustness of the observed patterns across varying prediction stringencies with the aim to find a suitable threshold that balances the quality of the predicted gene sets with the number of candidates. In this chapter I also explore the temporal dynamics of HGT gain into GPA in an effort to discover whether lineage evolution was shaped by continuous gene influx or by punctuated bursts of gene gain. Finally, I cautiously look at the most prominent donors of HGTs into GPA to explore whether ecology or phylogenetic similarity drives gene flow into this model group.

3.1.1 Function and nucleotide content as hallmarks

As discussed in Chapter 1 (Section 1.2), distinct patterns have been pre- viously observed for genes predicted to have been derived from HGT events - both in terms of the function of their protein product and in terms of nucleotide con- tent. Briefly, genes encoding proteins that do not require an established, tightly controlled co-dependent network to function (such as catabolic enzymes) and genes that can immediately improve chances of survival in a particular environment (such as antibiotic resistance genes) have been found to be transferred more frequently. Conversely, functions refractory to transfer include those involved in translation and the cell cycle and are typically encoded by highly conserved, operational genes. In the context of nucleotide content, given prior observations that recently transferred genes tend to be AT-rich relative to the donor genome (e.g. Daubin, Lerat, and Per- rière 2003) and that acquired genes ameliorate over time (Hao and Golding 2006;

Chapter 3 3.1.1 85 Chapter 3 3.1. INTRODUCTION

Lawrence and Ochman 1997), there is a reasonable expectation that genes derived from more recent HGT events should have a lower GC content than genes acquired in older transfers. These apparent hallmarks of HT genes can be used to assess the quality of my HT gene predictions derived in Chapter 2 and justify (or not) my HGT detection approach. Enrichment or depletion in certain gene product functions will give a first indication of whether my gene history assignments match expectations for horizontally transferred and vertically inherited genes. If a pattern is observed, it will be important to evaluate to what extent it is robust to prediction stringency. Since the functional bias of HT genes is consistent in the literature, seemingly regardless of the age at which the HGT events were predicted, I would likewise not expect to see differences in functional profile between recent and old HGT events. To assess my classification of HGTs as either recent or old, I could look at the nucleotide content of the transferred genes, expecting to find that recently acquired genes have a lower GC content if I successfully partitioned transfers based on age. Furthermore, as the GPA group contains genomes with varying mean GC contents, there is an opportunity to investigate to what extent variation in the GC content of HT genes is dependent on the GC content of the host.

3.1.2 Temporal patterns of HGT into GPA

Having a time-resolved dataset of transfers into GPA provides an excellent opportunity to investigate temporal dynamics. As discussed in Chapter 2 (Section 2.3.6), the reference tree used for HGT inference does not have branch lengths that are scaled with true evolutionary time. As such, there is no prior expectation that the number of HGTs predicted to have been gained at any branch should correlate with the period of evolution that branch represents. This provides the opportunity to investigate to what extent the timing of HGT events is determined by the branch lengths of the reference tree or by true evolutionary distance. Further, if there is a

Chapter 3 3.1.2 86 Chapter 3 3.1. INTRODUCTION relationship between the number of transfers and evolutionary time, is it linear? I.e. is the timing of gene gain best modelled as a continuous process or by punctuated bursts of gain connected by periods of reduced influx? Finally, it will be interesting to see whether recent HGTs follow the same pattern as the older gains. In the case of older transfers, the observed rate of gain will be a combination of some inherent rate of gene gain minus the rate of gene loss due to selection. However, recently acquired genes may not yet have triggered longer term purifying selection and so elevated numbers of transfers might be observed. Since there is no a priori expectation of the nature of HGT timing, this metric cannot be conclusively used to differentiate between the quality of HGT predictions at different stringencies. Nevertheless, it will be interesting to see to what extent any observed patterns remain stable to change in reconciliation parameters.

3.1.3 Sources of genes transferred into GPA

There are a number of factors that can determine the sources of genic diversity (see Chapter 1, Section 1.2.2). Apart from intrinsic qualities, such as se- quence similarity and codon usage, that may improve chances of integration and are likely shared with phylogenetically close organisms, extrinsic qualities such as phys- ical proximity and ecology can also result in highways of gene sharing. Organisms that share the same ecological niche are valuable sources of functional diversity - to copy a neighbour may give an organism its best chance of survival in a limited environment. Alternatively, in mutualistic communities, where survival is depen- dent on metabolic cascades, gene flow would promote genic equilibrium resulting in greater streamlining of individual genomes. As a result, there is obvious interest in investigating the sources of gene influx into the GPA clade - to what extent is the functional diversity in this ecologically distinct group driven by ecology rather than phylogenetic proximity?

Chapter 3 3.1.3 87 Chapter 3 3.2. METHODS

3.2 Methods

3.2.1 COG functional annotation

Gene function was assigned to each orthologous gene family in our dataset using the EggNOG 4.5.1 framework (Huerta-Cepas et al. 2016). Since the gene families (here referring to the reduced set; see Chapter 2, Section 2.3.3) were GPA- centric, functional annotation was only performed for the GPA proteins within each gene family, to reduce computational workload. Thus, GPA proteins were extracted and individually queried using the emapper.py script against the EggNog bact_50 database hosted as a local server using the hmmpgmd command line toold. Of the 6726 gene families, 776 failed to find any orthologues within the EggNOG database and these are further annotated with the nabla symbol (∇). For gene families in which orthologues were successfully identified in the database, func- tional annotations were available at different depths. I chose to use the broadest set, the COG annotation, for easy comparison across many gene families. For simplicity, instances where a single gene family was assigned more than one COG class the gene families were excluded from further functional analysis (either because indi- vidual sequences had multiple annotations, or multiple GPA sequences within the gene family had different COG annotations). As a result, there were fewer COG annotations than number of genes in each set (Table 3.1).

Table 3.1: Annotating COGs at different transfer costs

Prediction Type Transfer Cost Genes COGs HGT 3 13191 12590 HGT 4 9791 9430 HGT 5 7737 7443 HGT 6 6556 6348 Vertical 3 27375 26906

To estimate enrichment in function across each set (e.g. T3, T4), raw counts of COGs were converted to a proportion of all HT genes belonging to that

Chapter 3 3.2.1 88 Chapter 3 3.2. METHODS

COG category - e.g. 0.312 of HT genes predicted at T4 belonged to COG S (see Table 3.2). Enrichment was calculated for each COG category as:

 proportion of HT genes in COG  Enrichment = log proportion of vertical genes in COG

COG categories with fewer than 50 genes in the HGT set were excluded from the analysis (italicised, Table 3.2).

3.2.2 Nucleotide content calculation

In a manner identical to how protein sequences were extracted in Chapter 2 (Section 2.2.1), nucleotide sequences for CDS features were extracted into fasta formatted ‘genomes’ from the downloaded gbff-formatted files. Even though only GPA nucleotide content is addressed in this chapter, this was done for every genome in the 5073-genome dataset. Each nucleotide sequence was matched to the cor- responding protein entry by the protID and added to the SQLite database. In a handful of cases the nucleotide sequence length was not a multiple of three of the protein sequence; these corresponded to either predicted sites of ribosomal slippage or very rare introns. GC content was calculated per gene as the proportion of G and C nucleotides using the alphabetFrequency function in the Biostrings R package (Pagès et al. 2017). In order to be able to directly compare and contrast the nucleotide contents of GPA genes derived from genomes with different mean GC contents, I calculated a normalised GC content value per gene. This value was simply calculated as the gene GC content minus the mean genomic GC content. Thus, a normalised GC value < 0 corresponded to a relatively AT-rich gene for that genome, and vice versa.

Chapter 3 3.2.3 89 Chapter 3 3.2. METHODS

3.2.3 GPA phylogeny and temporal analysis

The GPA phylogeny, as used in the reconciliation (Figure 3.4A), was pruned from the final reference tree. The GPA phylogram, with branch lengths proportional to the number of substitutions per site (Figure 3.4C), was a pruned subtree of the Bacillaceae reference tree assembled using a concatenation-based approach at a 1E-50 threshold in Chapter 2 (Section 2.3.5). Phylogenies were processed in R using functions from the ape R package (Paradis, Claude, and Strimmer 2004). Assignment of HGT events to corresponding branches in the GPA phylogeny was facilitated by using the phylo4 object system, and associated functions, from the phylobase R package (Hackathon 2017). Phylogenies coloured by age of HGT (see Figure 3.4, panels A, C) were plotted using a slightly modified version of the plotBranchbyTrait function from the phytools R package (Revell 2012). Linear correlations and coefficients of determination (R2) and all confidence values were calculated using the rcorr function from the Hmisc R package (Harrell Jr, Charles Dupont, and others. 2018).

3.2.4 HGT donor identification

To investigate the most common sources of HGT as gained by the GPA, I identified the donor branch for each lHGT event predicted at stringency T4 from the processed output of the Mowgli program. I then found all the extant taxa descendent from the donor branch using the descendants function from the phylobase R package. Each extant descendant of the donor branch, represented by a taxid, was queried against the NCBI taxonomic database to identify the scientific family name. The taxonomic identification was done locally, using the nodes.dmp and names.dmp data files sourced from the NCBI ftp server: ftp: //ftp.ncbi.nlm.nih.gov/pub/taxonomy/. Where a donor branch led to multi- ple extant taxa belonging to more than one taxonomic family, the most commonly

Chapter 3 3.2.4 90 Chapter 3 3.2. METHODS represented family was chosen as the donor family for that event. The HGT contri- butions from the Bacillaceae group, which defined all HGT events within the sHGT set, were added by counting the number of unique sHGT events at T4.

Chapter 3 3.2.4 91 Chapter 3 3.3. RESULTS

3.3 Results

Where I do not explicitly differentiate in the text, HT or HGT-derived genes should be taken to mean genes predicted to derive exclusively from lHGT events (see Chapter 2, Section 2.3.7).

3.3.1 Function of HT and vertical genes

HT genes show a strong functional bias

To establish whether HT and vertical genes displayed different functional preference, I assigned COG functional categories to all genes across the 25 GPA genomes (see Methods). For HT gene predictions across all transfer costs, the most populous COG function was ‘function unknown’ (COG S). The number of HGT- derived genes assigned to each COG class, along with the function of each COG, for HT genes predicted at transfer cost 4 (T4) are presented in Table 3.2.

Table 3.2: COG assignment at one transfer cost (T4)

COG COG Category Function Genes Proportion of HGT Genes S Function unknown 2939 0.3120 L Replication, recombination and repair 902 0.0957 G Carbohydrate transport and metabolism 790 0.0838 C Energy production and conversion 662 0.0702 E Amino acid transport and metabolism 600 0.0636 K Transcription 594 0.0630 P Inorganic ion transport and metabolism 558 0.0592 T Signal transduction mechanisms 513 0.0544 M Cell wall/membrane/envelope biogenesis 354 0.0375 Q Secondary metabolites biosynthesis, transport and catabolism 267 0.0283 V Defense mechanisms 245 0.0260 I Lipid transport and metabolism 223 0.0236 O Posttranslational modification, protein turnover, chaperones 191 0.0203 H Coenzyme transport and metabolism 183 0.0194 J Translation, ribosomal structure and biogenesis 169 0.0179 ∇ No function assigned 153 0.0162 F Nucleotide transport and metabolism 46 0.004 9 D Cell cycle control, cell division, chromosome partitioning 25 0.0027 N Cell Motility 13 0.0014 U Intracellular trafficking, secretion, and vesicular transport 3 0.0003

Note: italicised rows are excluded from enrichment analysis (see Section 3.2.1)

Chapter 3 3.3.1 92 Chapter 3 3.3. RESULTS

Next, it was necessary to define a baseline against which enrichment could be measured. For this purpose, I selected the set of genes predicted to be vertically- inherited. Here, however, instead of iterating over different prediction costs I selected the most conservative dataset, corresponding to T3 (see Chapter 2, Section 2.3.7). This was done for two reasons. First, a relatively high number of genes were included even at this most restrictive cost (N = 27375, corresponding to 1280 gene families - see Table 2.3). Second, my aim was to provide the most HGT-free baseline to en- hance any potential differences in enrichment. As with HT genes, the most common functional category amongst vertical genes was COG S (35.2% of genes). The enrichment profile for a single transfer cost (T4) is shown in Figure 3.1A (for calculation, see Methods). HT genes are most strongly enriched in the de- fence (COG V) function, followed by carbohydrate metabolism (COG G), secondary metabolite biosynthesis (COG Q), and no assigned function (COG ∇). In turn, the HT gene set is most strongly depleted in genes involved in translation (COG J) and coenzyme transport and metabolism (COG H). These observations are in line with previous observations of functional bias in HGT-derived genes (see Chapter 1, Section 1.2.1). The ostensibly counter-intuitive enrichment for genes belonging to replication, recombination and repair (COG L), an informational gene category, is, in fact, expected. This COG category includes genes encoded by mobile elements such as transposases and integrases, and these are abundant in the predicted HT gene set.

Pattern of functional bias is stable across stringency of prediction

Next, I compared whether this enrichment profile was consistent for HT gene predictions across different transfer costs. In all cases, the vertical gene set predicted at T3 (the most stringent level) was used as a baseline. The results, shown in Figure 3.1B, show that the clear HT gene functional enrichment profile remains remarkably consistent, regardless of the stringency of HGT prediction.

Chapter 3 3.3.1 93 Chapter 3 3.3. RESULTS

Figure 3.1: HT genes patterned by function. (A) COG functional categories en- riched (green) and depleted (orange) for HT genes predicted at T4. Only COG classes with at least 50 genes are shown. A value of 0 would mean this COG class is equally represented in HT and vertical genes. (B) HT gene enrichment across COG classes and transfer costs. Boxes represent the range of enrichment values across costs. (C) HT gene functional enrichment extended to higher transfer costs, calculated on a different dataset (see text). Note: values are not log-scaled, values above 1 (green line) indicate enrichment for HT genes, values between 0 and 1 signify depletion for HT genes. Boxes as in (B). (D) COG enrichment for sHGT-derived genes only. Scales, colours, and boxes as in (B).

Although not calculated for the current dataset (see Chapter 2, Section 2.3.1), stability in this functional bias was observed across much more extreme transfers costs (Figure 3.1C). Despite the large fluctuations at the highest transfer costs, driven by a low number of genes, COGs V, G, L, and Q were still enriched in the HGT set, while COGs H and J remained strongly depleted. Thus, even the least stringent HT gene set predicted at a transfer cost of 3, and so not filtered for constancy or consistency (Chapter 2, Section 2.3.7), repre- sented a viable dataset for downstream analysis - at least in terms of the stability

Chapter 3 3.3.1 94 Chapter 3 3.3. RESULTS in functional enrichment observed here. sHGTs show a weaker trend of functional enrichment

I then looked at the extent to which predicted sHGTs follow the same pattern of functional bias (Figure 3.1D). I found that the same general pattern was still present, albeit much less clearly defined. Although COGs V, G, and ∇ were still enriched, and COG J still depleted, for HGT, the rest of the categories did not follow the lHGT genes as clearly (compare to Figure 3.1B). For example, COG H was not strongly depleted in the sHGT set. Indeed, many more categories clustered close to 0 - no enrichment. Thus, although sHGT genes did show some familiar hallmarks of HGT functional bias, the pattern is more dilute. This could be due to the inclusion of mispredicted vertical gene families or a lessened connectivity barrier for genes coming from more closely related organisms.

3.3.2 Nucleotide content of HT and vertical genes

HT genes are more AT-rich than vertical genes

Next, I compared the GC content of HT and vertical genes. As above, I first made the comparison for genes predicted at a single transfer cost (T4); the vertical genes used in this analysis are again derived from the most stringent set. As a point of comparison, I additionally included the GC content of all other chromosomal genes that were not assigned to either the HGT or vertical sets - the “grey zone” as defined in Chapter 2 (Section 2.3.7). This grey zone included genes predicted to have been gained via sHGT. As can be seen from Figure 3.2A, HT genes have a GC content that is on average lower than the genome mean and the opposite is observed for the vertical genes. The grey zone genes, presumably a mix of some vertical genes and some HT genes that did not pass the detection filters, have a mean GC content between the two. The significant difference between the HT and vertical GC contents is clear, despite most HT genes (86.5%) being classified as derived

Chapter 3 3.3.2 95 Chapter 3 3.3. RESULTS

Figure 3.2: GC content of vertical and HT genes. (A) Distributions GC contents of HT/vertical/grey zone genes. To allow comparison across genomes with different GC contents, GC contents are normalised to the mean GC content of all genes on a given chromosome (see Methods). *p<0.0001. (B) Comparison of HT gene GC contents pre- dicted with different transfer costs. Values normalised as in (A) (C) Differences in HT and vertical gene GC contents across 25 GPA genomes, HT genes predicted at T4. Phylogeny as in Figure 2.6, red line represents the most parsimonious branch of GC content increase. ****p<0.0001; **p<0.01

Chapter 3 3.3.2 96 Chapter 3 3.3. RESULTS from old HGT events (and so would likely be fully or partially ameliorated). This pattern is observed across all transfer costs (Figure 3.2B). The trend is significantly reinforced with increased detection stringency, though the magnitude of the change in the mean between individual costs is relatively small.

HT genes are GC-poor regardless of genomic GC content

To see whether the observed pattern of decreased GC content in HT genes was uniformly observed across all species or driven specifically by a set of outliers, I compared HT and vertical gene nucleotide content by genome. Interestingly, the mean genomic GC content changed by about 10% within the GPA lineage (see marked branch in Figure 3.2C). This has resulted in two distinct groups within the GPA: Geobacillus with a mean GC content of ~52% and Anoxybacillus and Parageobacillus genomes with a mean GC content of ~42-45%. Although a single increase in genomic GC content is most parsimonious, I cannot exclude multiple decreases in GC content in the remainder of the GPA lineage as I have not estimated ancestral Bacillaceae GC content. Measured for each genome independently, it is clear that the mean GC content of HT genes is uniformly lower than that of vertical genes (Figure 3.2C). This is in line with observations in other prokaryotes (Chapter 1, Section 1.2.1).

Recent HT gene gains are most AT-rich

Driven by the expectation that recently transferred genes are likely to be the most AT-rich, I used my age-partitioned HT gene set to investigate whether this was the case for GPA genes. Some branches of the GPA phylogeny that are considered to be “old” in my analysis may in fact have harboured recent transfers as well (see Chapter 2, Section 2.3.8). However, I still include all genes predicted to derive from HGT events at these branches in this analysis. This potential contamination by recent

Chapter 3 3.3.2 97 Chapter 3 3.3. RESULTS

Figure 3.3: GC content of old and recent HT genes. (A) Comparison of old and recent HT gene GC contents, predicted at T4. GC content values are normalised, as in 3.2 (B) Comparison of vertical and old HT gene GC contents in those genomes where recent and old HT genes were partitioned (light green branches, Figure 3.2C). The numbers above each boxplot represent the number of genes in that set. ****p<0.0001

Chapter 3 3.3.2 98 Chapter 3 3.3. RESULTS genes in the old gene-set makes the comparison more conservative. For cost T4, I find that recently transferred HT genes pooled across all 25 GPA genomes are significantly more AT-rich than their more anciently-gained counterparts (Figure 3.3A). I then wondered whether there is still a difference in the mean GC content of vertical genes and old HGT genes (in those genomes for which I have partitioned recent and old HGT sets). A per-genome comparison revealed that there was still a significant difference in GC content between genes deriving from old HGT events and the vertically-inherited genes (Figure 3.3B), though the gap appears reduced. It is also interesting to note that although vertical genes are generally more GC-rich, in the majority of genomes that are compared in Figure 3.3B it is in fact an HT gene that contains the highest GC content. This appears to also be the case when comparing recent and old HT genes. Thus, although there is an overall preference for GC-poor genes to be acquired, there is clearly some flexibility in the nucleotide content of incoming genes. It is possible that a less suitable nucleotide composition can be mitigated by an increased functional benefit provided by the protein product.

3.3.3 Temporal dynamics of gene flow into GPA

To calculate the timing of gene gain I correlated the number of HGT events predicted to be gained along a given branch in the GPA phylogeny with the length of that branch. In the first instance, I wanted to see whether the number of HGT events corresponded to the branch lengths (these do not correspond to evolutionary time, see Chapter 2, Section 2.3.5) in the reference tree used for the reconciliation (Figure 3.4A). This was first done for predictions at a single transfer cost, T4. As can be seen from Figure 3.4B, there is no trend between the number of transfers gained at the branch and the length of that branch in the reference tree (r = 0.037, R2 = 0.0014; p = 0.804). This rules out branch length in the reference tree as a determinant of the number of HGT events predicted at a particular branch. This result remained consistent across all transfer costs.

Chapter 3 3.3.3 99 Chapter 3 3.3. RESULTS

Next, I made the same comparison but with branch lengths that corre- sponded to evolutionary distance. In lieu of a fully dated ultrametric tree, I used a standard phylogeny, built from a concatenated alignment of 1-to-1 orthologue genes (see Methods), where branch length is measured in substitutions per site (Figure 3.4C). I consider this to be a suitable proxy for evolutionary time given that the underlying data corresponds to highly conserved, likely vertically inherited genes that can be expected to evolve in a consistent manner over time. Figure 3.4D shows a clear, approximately linear relationship between branch length and number of transfers into that branch (r = 0.581, R2 = 0.34; p < 1E-04). Furthermore, this relationship was clearly recovered at all stringencies of prediction (Figure 3.4E, column 2). Interestingly, two long branches (daggers in Figure 3.4C, D) appeared to be outliers with fewer transfer events that might be expected given the otherwise strong linear relationship. The first of these, corresponding to the branch separating Geobacillus species from the Parageobacillus and Anoxybacillus clades separates two parts of the GPA phylogeny with markedly different GC contents (see Figure 3.2). Such a significant shift in global GC content may result in the “artificial” elongation of the branch, i.e. it would be a poorer proxy of evolutionary time as more substitutions would be expected. The second outlier branch leads to the first of the two Anoxybacillus subclades. In this case, I found that the three genomes in that clade all had markedly smaller genomes sizes than the rest of the GPA clade (mean 2.82Mb versus 3.60Mb). This suggests a period marked by net genetic loss, and so may lead to a lower number of detected HGT events in the extant descendants. Unsurprisingly, excluding the two outlier points from the correlation improves the linear relationship (Figure 3.4E, column 3). Next, I looked at whether HGT events into the recent branches showed a different pattern. Excluding the old branches and zooming in on just the recent events (Figure 3.4F) did not suggest any correlation between branch length (even if

Chapter 3 3.3.3 100 Chapter 3 3.3. RESULTS

Figure 3.4: Timing of HGT across GPA lineage evolution. (A) GPA phylogeny as used in the reconciliation, branches coloured by age as in Figure 2.6. (B) Plot of (A) branch length against the number of HGTs gained at that branch. (C) Phylogeny with branch length corresponding to evolutionary distance, as in Figure 2.6. (D) Plot of (C) branch length against the number of HGTs gained at that branch. † and ‡ highlight two outlier points, branches are labelled in (C). (E) Table of correlation values of branch length versus HGTs gained across different transfer costs, with and without the outlier points labelled in (D). (F) Plot of branch length against number of recent HGTs only. N refers to the total number of recent HGT events included.

Chapter 3 3.3.3 101 Chapter 3 3.3. RESULTS very short) and the number of transfers, and this was borne out by an insignificantly negative linear correlation (r = -0.0187, R2 = 0.000348; p = 0.934). Thus, although evolutionarily recent branches appear to have experienced mixed amounts of gene influx, there is no clear relationship between the length of the branch and the amount of gene flow. This could be either because purifying selection has yet to affect these genes or because there are relatively few HGT events per branch and their relationship is obscured by noise.

3.3.4 Sources of HGT into GPA

To investigate the major contributors of HT genes to GPA lineage evo- lution, I considered the of HGT donors predicted during reconciliation. Although excluded from the main set of HT genes, as they are sHGTs, the number of transfer events from the Bacillaceae family (Figure 3.5) outweighs all other donors by a large margin. The main donor contributing to the lHGT-based set of HT genes is the family Paenibacillaceae, predicted to be the source of over 400 HGTs into GPA (Figure 3.5). This is twice more than the next biggest donor, Planococcaceae. In fact, six of the ten most common donor families are in the order (Paeni- bacillaceae, Planococcaceae, Alicyclobacillaceae, Thermoactinomycetaceae, Listeri- aceae, Staphylococcaceae) to which the GPA clade also belongs. Despite the appar- ently close phylogenetic proximity between GPA and these donors, they are in fact separated by a long period of evolutionary time. A recent estimate of a Paenibacil- laceae and GPA common ancestor dates to 1.73 billion years (Marin et al. 2017). In contrast, there are signs of shared ecology between the most common donors and GPA. For example, Paenibacillaceae includes facultatively anaerobic species which are frequently isolated from soil and compost habitats (Mayilraj and Stacke- brandt 2013), environments they share with GPA taxa. This is further supported by anecdotal evidence of Paenibacillus and Geobacillus being co-sampled, including an instance where a Paenibacillus isolate was misclassified as Geobacillus due to

Chapter 3 3.3.4 102 Chapter 3 3.3. RESULTS shared isolation conditions (Mead et al. 2012). Furthermore, notable non-Bacillales donors, Clostridia and Thermoanaerobacteraceae, are both anaerobic, and the latter is also thermophilic; these traits again suggest possible shared ecology. Finally, the only common non-Firmicute donor is ( phylum). Its preferred habitats are soil and vegetation, and at least in the former it is likely to encounter members of the GPA clade. Aside from the 10 largest donor families, lHGTs into GPA were predicted to originate from 198 other taxonomic families, accounting for 910 transfers events. Taken together, these observations tentatively suggest a role for shared ecology in shaping gene flow into GPA.

Figure 3.5: Common HGT donors into GPA. Barplot showing the most common HGT donor groups (at the family taxonomic level). Dashed fill represents sHGT donors; all HGT events from non-GPA Bacillaceae donors are sHGT by definition. Solid bars represent lHGT events. The “All others” group comprises 198 taxonomic families.

Chapter 3 3.3.4 103 Chapter 3 3.4. DISCUSSION

3.4 Discussion

In this chapter I have evaluated my HT and vertical gene predictions in the context of known functional and nucleotide bias that have been repeatedly as- sociated with HGTs in the literature. I found a clear pattern in the enrichment and depletion of HT genes for certain COG classes when compared to the baseline ver- tical gene set. As expected based on observations in previous works, genes involved in defence and carbohydrate metabolism were highly enriched in the HT gene set, whilst those involved in translation were strongly depleted. Importantly, this pat- tern remained highly consistent across all the HT gene sets that were predicted at varying stringency levels, including the least stringent dataset that bypassed many late-stage filters (T3). This observation was promising, testifying to the quality of the reconciliation calculations. In fact, consistency in the functional enrichment pattern continued at ‘extreme’ transfer costs levied in reconciling a previous dataset. Whilst the pattern and its robustness was encouraging, I was surprised to see other metabolic categories such as energy production and conversion (COG C) and amino acid transport and metabolism (COG E) to not show the same levels of enrichment. This possibly highlights a pitfall in the COG-method of classifying enrichment. Although having just 20 categories means enough genes can be binned into each to show significant differences between datasets, this necessitates some forceful “boxing together” of genes. The menagerie of functions within a single COG class are unlikely to all have the same biological propensity for HGT, resulting in perhaps unexpectedly muted enrichment profiles. An alternative approach might involve pooling genes across the classically defined COG classes that share more refined functions, such as taking transport proteins, multi-protein complex proteins, etc., and performing the same enrichment calculation. However, as the focus of this analysis was to see whether HGT and vertical genes differed functionally in an expected manner, and this was achieved, alternative methods were not explored further.

Chapter 3 3.4.0 104 Chapter 3 3.4. DISCUSSION

The set of HT genes going into this chapter was conservatively selected to consist of exclusively lHGT-derived genes. This was done due to the concern that the sHGT set might have a higher proportion of false positive hits due to systematically poor resolution of closely related taxa in the gene trees. Analysis of the COG enrichment profile for the sHGT set showed a weakly similar pattern to the lHGT genes that became more similar to lHGT with increasing stringency. It is not possible to determine from my analysis whether this diluted pattern is due to the inclusion of truly vertical genes in the sHGT set (false positives) or whether transfer between more closely related taxa makes the transfer of certain gene types more permissive due to fewer intrinsic barriers. Nevertheless, in an attempt to retain a highly stringent HGT dataset, I continued to exclude sHGTs from the HT gene set. To further evaluate the quality of the HT gene set, and its robustness to prediction stringency, I evaluated patterns in nucleotide content - specifically the GC-richness of the genes. My observation that HT genes were generally more AT- rich than the genomic average (even for genomes with divergent GC contents), and much more so than vertical genes, once again matched my expectations based on previous works. Furthermore, I found that the most recently transferred genes were significantly lower in GC content than those derived from older HGTs. The strength of this observation supported the validity of my ad hoc approach for segregating old and recent HT genes, outlined in Chapter 2 (Section 2.3.8). As with the functional analysis, the robustness of the above observations with respect to increasing stringency of prediction did not reveal a single most suitable transfer cost. Whilst I did observe a general decrease in GC content in HT genes with increasing transfer cost, this could be explained by an inherent bias in the reconciliation calculation. At high transfer costs, predicted HGTs into more ancient nodes of the GPA lineage are more likely to be parsimoniously explained by a combination of duplications and losses while recent transfers into shallower

Chapter 3 3.4.0 105 Chapter 3 3.4. DISCUSSION nodes (subtending one or two taxa) are more resistant to this effect. This “artificial” selection for more recent transfer events may skew the nucleotide content of the entire HT gene set towards AT-richness, as the average level of amelioration of the genes declines. Given the robustness in the functional and nucleotide patterns observed, how can a suitable transfer cost (and so prediction stringency) be chosen? As men- tioned previously, this process of qualitative evaluation was an exercise of balance: selecting a qualitatively robust dataset that maximised the number of HT genes included. To that end, I chose to continue with the HT gene set predicted at a transfer cost of 4 (T4). The reason for not continuing with T3 dataset, and reaping the extra ~3000 genes (see Table 2.3), was that despite qualitative consistency it was not subjected to the post-reconciliation filters (see Chapter 2, Section 2.3.7) and so would suffer from the biggest error due to the heuristic nature of the reconciliation algorithm. In the next part of this chapter I interrogated the temporal dynamics of HGT into the GPA lineage. I found a correlation between branch length (a proxy for evolutionary time) and the number of transfers into that branch. I ensured that this result was not due to an inherent bias introduced during the reconciliation cal- culation (that attempts to be time-consistent). Further, this observation was highly consistent for HT gene sets predicted at all stringency thresholds. Biologically, this strongly implicates a model of continuous, rather than punctuated, gene gain for GPA lineage evolution. Evaluating just recent transfers suggests a stochastic in- flux of genes that settles into the linear relationship observed in the older transfers by the action of long-term purifying selection - known to be a dominant force in bacterial evolution. Although I find no evidence of punctuated gene acquisition in GPA lineage history, it cannot be ruled out that bursts of gene gain occurred and their signal was subsequently smoothed and masked by overwhelming purification. Although, my observations seem conclusive with respect to the GPA clade, there is

Chapter 3 3.4.0 106 Chapter 3 3.4. DISCUSSION no guarantee that this mode of HGT gain is dominant - even within other closely related groups. Further work of this type in other, divergent clades would be in- teresting to determine whether the continuous model of gene gain more broadly describes microbial evolution. Finally, I briefly looked at the identities of the donors of predicted HGT events into GPA. However, these results come with the caveat of a suboptimal refer- ence tree outside of the GPA local neighbourhood possibly skewing observed donor predictions. I found that transfers from the Bacillaceae comprise a large proportion of all predicted HGT events. This is not surprising, since HGT between phyloge- netically close species is known to be common because of higher shared similarity in gene composition, expression programs, and ability to integrate safely by homol- ogous recombination. Furthermore, the Bacillaceae group does include taxa with shared growth conditions and environment, like the Bacillus smithii group (Bosma et al. 2016), which would provide valuable gene exchange partners - sharing ecol- ogy and close relation. Thus, transfers coming into any Bacillaceae organism would likely be exchanged amongst all other Bacillaceae taxa sharing a similar lifestyle over evolutionary time, resulting in these gene families being relegated to the sHGT set in my analysis if they first arrived outside of GPA. Further, the likelihood of false positive HGT prediction is increased for transfers within the Bacillaceae, due to poor phylogenies of truly vertically-inherited genes, and hence sHGTs are excluded from the main HT gene dataset - however, this may also increase the number of Bacillaceae donors predicted. Outside of the Bacillaceae, six of the ten most common contributors to GPA genetic diversity were within the same taxonomic order as the GPA, the Bacillales. However, given the large divergence time between GPA and potential Bacillales donors, I consider ecology to be the major determinant of these highways of gene exchange. Strikingly, the top eight taxonomic families are known for their either anaerobic or thermophilic (often both) lifestyles, and in many cases species are

Chapter 3 3.4.0 107 Chapter 3 3.4. DISCUSSION known to share the same environment as GPA. Given this observation, it seems likely that ecology is the driving factor in defining HGT donor-recipient relationships with respect to GPA - especially since GPA ecology is often aerobically or temperature restrictive, limiting potential gene gain from ecologically divergent taxa.

Chapter 3 3.4.0 108 Chapter 4

Topology of HT genes in GPA genomes

109 Chapter 4 4.1. INTRODUCTION

4.1 Introduction

In this chapter I detect and analyse the genomic/spatial distribution of HT and vertical gene sets that I derived by my HGT inference approach (Chapter 2) and evaluated against prior observations in the context of functional and nucleotide characteristics (Chapter 3). The importance of bacterial genome organisation is dis- cussed in detail in Chapter 1 (Section 1.2.4), and some spatial patterns are discussed with respect to HGT. For example, HT genes appear to cluster at the terminus (Tou- chon, Bobay, and Rocha 2014; Zarei, Sclavi, and Cosentino Lagomarsino 2013) and prophage-free hotspots of HGT do not linearly decrease in frequency from origin to terminus (Oliveira et al. 2017). My aim for this chapter was to investigate the extent to which HGTs conform to and possibly inform genome architecture in GPA organisms.

Chapter 4 4.1.0 110 Chapter 4 4.2. METHODS

4.2 Methods

4.2.1 Genome coordinate normalisation

To consider the spatial distribution of genes along the chromosomes across different genomes in a cumulative fashion, the genomes need to be aligned, if not at the nucleotide level, then at least with regard to pertinent landmarks. To see to what extent the GPA genome assemblies were already consistent with respect to gene location, I looked at the position of the essential replication initiator gene, dnaA. I found, for example, that the dnaA gene in the G. kaustophilus assembly lies at coordinates 88-1440, while in Parageobacillus thermoglucosidasius the same gene lies at coordinates 613153-614505 on the complementary strand. This showed that the assemblies were centred on different parts of the genome. A key landmark in the bacterial genome landscape is the origin of replication, and given its importance in adaptive genome organisation (see Chapter 1) was a good candidate around which to rearrange the GPA genomes. Since previously calculated origin locations were not available for all 25 GPA genomes used in this analysis, I opted to use the dnaA gene in lieu since the gene is typically found adjacent to the origin. To evaluate whether the location of the dnaA gene served as a good proxy for the origin, I compared the coordinates of the start of the dnaA gene to the start of the predicted origin of replication for 1782 genomes from the DoriC v6.5 database (Gao, Luo, and Zhang 2013). I found that in 88.6% of the genomes, the origin and dnaA gene lay within 1% of the genome length of each other. This increased to 92% for within 5% of the genome length. Furthermore, in the 10 GPA and 45 Bacillus spp. genomes present in the DoriC dataset, the predicted origin of replication lay immediately adjacent to the dnaA gene. These results follow previous work showing that the origin lies on either side of the dnaA gene in B. subtilis, and that this arrangement is conserved with other firmicutes such as pyogenes and the more distantly related Mycoplasma capricolum (Briggs, Smits, and Soultanas

Chapter 4 4.2.1 111 Chapter 4 4.2. METHODS

2012). These observations seemed sufficiently robust to use the dnaA gene as a proxy for the replication origin in my analysis. Standardised coordinates were calculated by finding the distance between the first base of the dnaA gene and the first base of the target gene. Gene orientation was similarly recalculated relative to the dnaA gene. Thus, post standardisation, the dnaA gene lay at coordinates 0 to ~1400 on the leading strand for each GPA genome. Since the GPA genomes vary in size from just 2.8Mbp to almost 3.9Mbp, I normalised the gene coordinates to range from 0 (the start of the dnaA gene) to 1 (immediately upstream of the dnaA gene) to facilitate cross-genome comparison. Of course, this approach sacrifices the absolute spatial distances between the genes in individual genomes.

4.2.2 Gene distribution and zone boundaries

To calculate and visualise the distribution of genes across the circular chro- mosome (as in Figure 4.1A), I converted the normalised gene start positions (see above) for all relevant gene sets to circular coordinates followed by density estima- tion with the vonmises kernel and a bandwidth of 3000. These steps were performed using the circular and density.circular functions, respectively, from the circular R package (Agostinelli and Lund 2017). Density plots were produced using a custom wrapper function incorporating the plot.density.circular function from the circular package and the polyclip function, from the polyclip package (Johnson and Badde- ley 2017), which was used to colour the regions of HT gene enrichment and depletion. The linear representation of gene densities across the GPA chromosome (as in Figure 4.2C) was calculated in a similar fashion, using the stat_density function (parameters: n = 212, adjust = 1/10) within a ggplot function from the ggplot2 package. To avoid boundary effects of linear density plotting, the data range for genes was expanded 10% on either end of the plot and then the extensions were masked in the plotting functions using the coord_cartesian function in ggplot2.

Chapter 4 4.2.2 112 Chapter 4 4.2. METHODS

Boundaries separating zones were estimated by eye to both separate regions with differential HT/vertical gene enrichment and to be symmetrical with respect to their distance to the origin. Along the normalised chromosome, ranging from 0 (first base of dnaA) and 1 (immediately upstream of dnaA), the zone boundaries were placed at the following coordinates: 0.045, 0.130, 0.230, 0.375, 0.625, 0.770, 0.870, 0.955 (see Figures 4.2C and 4.3E). The zones were labelled and then coloured based on the dominant gene type within the zone.

4.2.3 Gene link analysis

To visualise major genome rearrangements in GPA history, all gene families comprising 1-to-1 orthologues were identified from the vertical gene set. For each such gene family (N = 1328), a single gene was selected as a “key gene”. The dis- tances between the key gene and all its orthologues, so-called gene links, were then calculated. Drawing links between a key gene and other orthologues significantly simplified the graph, compared to drawing a link between each pair of orthologues, while still showing the overall distribution of the gene families. To focus on large scale rearrangements, links were filtered to include those that spanned at least 5% of the normalised genome size. Calculated links were wrangled into a suitable format to be plotted using the RCircos R package (Zhang, Meltzer, and Davis 2013). Plot- ting was done using a custom wrapper script utilising a modified RCircos.Link.Plot function (added ability to adjust the width of the linking lines). Link lines, as in Figure 4.2B, were recoloured as a gradient to better illustrate the patterns of genome rearrangement in Adobe Illustrator CS6.

4.2.4 Genomic islands and tRNA

GI predictions, for all available genomes and by all methods, were down- loaded from the IslandViewer4 resource (Bertelli et al. 2017) in CSV format on 15 May 2018. The data was filtered to include only GPA GI predictions, available for

Chapter 4 4.2.4 113 Chapter 4 4.2. METHODS

17 of the 23 non-rearranged GPA genomes. GI start and end positions were stan- dardised for each genome, as done previously for genes (see Section 4.2.1). A gene was considered to lie within a GI if the start coordinate of the gene was within the span of any individual island predicted for that genome. Similarly, to estimate what proportion of GI genes lay within each zone (e.g. Near Origin, Flank), I counted the number of GI gene start coordinates within each zone. To map the locations of tRNA genes, I processed each original GPA Genbank-formatted file to extract the nucleotide sequences of tRNA genes in FASTA format. This was done using the genbank_to_fasta.py script in the same way as for protein sequences (see Chapter 2, Section 2.2.1), except selecting for tRNA features instead of CDS. The coordinates of the tRNA genes were standardised for each genome, as above.

4.2.5 Gene orientation and nucleotide content

Each gene lay either on the leading strand (same strand as dnaA) or lagging strand (opposite strand to dnaA). The preference of genes to lie on either strand was calculated for each of 200 equally sized, non-overlapping bins (thus, each bin = 0.05% of the chromosome) across each GPA genome for each appropriate set of genes (e.g. vertical, old HT genes). The calculation was as follows for each bin, where |x| denotes the number of genes on strand x:

|Lead| − |Lag| Stand P reference = P(|Lead| + |Lag|)

The calculated values were then plotted as a Loess smoothed curve across the com- bined GPA chromosome using geom_smooth within a ggplot function with the pa- rameter span = 0.1 (Figure 4.4A). In this plot, for each bin there are up to 23 underlying values informing the curve, one for the strand preference of each under- lying GPA genome. Boundary effects of fitting a curve on a linear plot were avoided

Chapter 4 4.2.5 114 Chapter 4 4.2. METHODS as above (see Section 4.2.2). Nucleotide content across the chromosome was calculated in a similar way to strand preference, for each GPA genome across 200 non-overlapping bins in a given genome. Here, I used the genome-normalised relative GC content to be able to compare nucleotide bias across genomes with disparate mean GC contents (see Section 3.2.2). For each set, e.g. HT genes, I calculated the mean relative GC content of all relevant genes within the bin for each genome. The resulting values were plotted as for strand preference, above. To isolate the effect of local nucleotide content on the nucleotide compo- sition of HT genes, I used the per-bin GC content calculated above. For each bin for each genome, I took the mean GC content of included HT genes (either old and recent transfers) and the mean GC content of included vertical genes. I only included bins with at least three vertical genes to improve the robustness of this measure of established local nucleotide content. Bins that lacked either HT genes or sufficient vertical genes were excluded. Each suitable bin for each genome could then be plotted with respect to its vertical and HT gene GC content (as in Figure 4.4C). The Pearson correlation of HT gene GC content to vertical gene GC con- tent was calculated using the rcorr function from the Hmisc R package (Harrell Jr, Charles Dupont, and others. 2018).

4.2.6 Functional enrichment by zone

To calculate functional bias by genome zone, each gene within each set (e.g. HT genes) was assigned to a zone based on its start coordinates. To facilitate com- parison, genes falling into the same zone on either replichore were pooled together so that the final distribution of genes ranged from origin to terminus (as in Figure 4.5A, top panel). The distribution of genes belonging to each COG functional cate- gory comprised of at least 50 genes was then plotted across the genome. To visualise enrichment (Figure 4.5A), the density was scaled individually for each COG class

Chapter 4 4.2.6 115 Chapter 4 4.2. METHODS and plotted using the stat_density function within a ggplot function with the fol- lowing parameters: aes(fill = 10..scaled..), geom = “tile”, n = 2000, adjust = 1/4. The scaling factor used was chosen specifically to highlight regions along the chro- mosome with the highest gene densities without the noise of background colour. COG categories were grouped (Figure 4.5A, dendrogram) by applying hierarchical clustering to the proportion of HT genes in the respective COG found in each zone. This was done using the base R functions dist and hclust with the ward.D2 method. The dendrogram was plotted with the ggdendrogram function from the ggdendro R package (de Vries and Ripley 2016).

4.2.7 Gene expression data

Microarray data from B. subtilis strain 168 grown in different media (Borkowski et al. 2016) were downloaded from NCBI Gene Expression Omnibus (GEO) (www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE78108). The ex- pression data was matched to B. subtilis gene position data for the same assembly, and gene coordinates were standardised based on the location of the dnaA gene and genome size, as above. For visualisation, transcript levels for each growth medium were plotted using geom_smooth within a ggplot function with the following parameters: span = 0.2, method = “loess”. Boundary artefacts were avoided in the same way as above (see Section 4.2.2).

4.2.8 Orthologue positions in non-GPA genomes

For a given COG category, I first selected all gene families that belonged to that COG (including all genes classified as HT, vertical, and neither). All the genes were partitioned based on whether they fell within the first 38% or the remaining 62% of the distance between the origin and terminus (see Section 4.3.6), and I selected the gene families that were represented in each partition. Some gene families had genes across GPA genomes that fell within both the 38 and 62 partitions, and so the

Chapter 4 4.2.8 116 Chapter 4 4.2. METHODS positions of their orthologues were conservatively represented in both downstream plots. For each gene family, I identified orthologues in both spore-forming and non- spore-forming species. All genomes used in this study that belonged to Alicyclobacillaceae, Bacil- laceae, Paenibacillaceae, Sporolactobacillaceae and Thermoactinomycetaceae were considered as spore-forming. The non-spore-forming set comprised all genomes be- longing to Listeriaceae, Staphylococcaceae, and the order Lactobacillales. How- ever, I excluded Bacillaceae, Sporolactobacillaceae and Thermoactinomycetaceae from further analysis so that the non-spore-forming set of genomes ended up being more closely related to the GPA clade than the spore-forming set. The aim was to control for phylogenetic drag, i.e. the likelihood of finding GPA-like patterns in a set of genomes only because those genomes are more closely related. By ensuring that the non-spore-formers are more closely related, I preemptively stack the deck against this possibility. Orthologue positions in spore- and non-spore-forming genomes were stan- dardised and normalised as for GPA genes (see Section 4.2.1). In addition, where orthologues were identified in multiple genomes that belonged to the same taxonomic species (e.g. two Streptococcus pyogenes strains), orthologue positions were averaged across all underlying genomes to eliminate bias from highly sequenced species. Or- thologue positions were plotted using stat_density within a ggplot function with the following parameters: geom = “line”, adjust = 1/10.

Chapter 4 4.2.8 117 Chapter 4 4.3. RESULTS

4.3 Results

Where I do not explicitly differentiate in the text, HT or HGT-derived genes should be taken to mean genes predicted to derive exclusively from lHGT events predicted at a transfer cost of 4 (T4; see Chapter 3, Section 3.4). Similarly, vertical or vertically-inherited genes should be taken to mean genes predicted to have an exclusively vertical evolutionary history within the GPA, predicted at the highest stringency (see Chapter 3, Section 3.3.1).

4.3.1 Gene history spatially patterns GPA genomes

To establish whether HT and vertical genes were differentially distributed across the genome, I determined the locations of the genes in each category across the GPA chromosomes. To standardise gene positions across all genome assemblies, I remapped the gene coordinates in each genome as a measure of distance relative to the replication initiator dnaA gene, which was a good proxy for the origin of replication, and normalised all coordinates for genome length (see Methods). I plotted the density of HT and vertical gene locations from across all 25 GPA genomes onto a normalised bacterial chromosome. Regions of the genome that had a higher density of HT genes than the overall density of all genes (HT + vertical + grey zone genes) were considered to be enriched in HGT, conversely regions where HT-gene density was less than the all-gene density were marked as HGT-depleted. Here, I observed a striking pattern - HGT enrichment was almost entirely constrained to three compartments in the combined GPA chromosome (Fig- ure 4.1A): one broad region spanning the terminus, and two smaller regions on either side, but not at, the origin of replication. HT genes were strongly depleted at the immediate origin and also in two broad flanks that are distal from the origin, one on each replichore. Although the sets of HT and vertical genes make up less than 50% of all GPA chromosomal genes (Figure 4.1B, N = 37166), over the majority

Chapter 4 4.3.1 118 Chapter 4 4.3. RESULTS of the genome those regions showing HGT enrichment were concomitantly depleted in vertical genes, and vice versa. This pattern, as expected from the consistency in other parameters seen in Chapter 3, was almost identical for HT genes predicted at different stringencies (Figure 4.1C). This presented the first suggestion that the distribution of genes along the GPA chromosome is patterned by the evolutionary history of the genes. Calculating gene density distributions based on locations of individual genes (and not gene families / HGT events, see Section 4.2.2) could lead to a bias where the oldest HGT events, having the most underlying genes, could end up driv- ing the overall density signal. To assess whether this was the case or if the HGT-rich zones were indeed more permissive to new gene arrival and maintenance, I looked at the distribution of genes derived from recent HGTs. I found that recently acquired genes were similarly strongly patterned to the same genomic regions - and similarly excluded from the immediate origin of replication and flank regions enriched in ver- tical genes (Figure 4.1D). This suggests that there is a strong initial selection on the location of integration of HT genes. This strong barrier is more consistent with ac- tive exclusion of genes from certain regions because they disrupt the local expression architecture, though could also stem from the fact that there are more opportuni- ties to integrate in the HGT-rich zones due to increased presence of elements that facilitate transfer, such as transposases and integrases. If this is indeed the case, the selection against integration in vertical-rich zones should be relaxed in transfers from more closely related genomes where integration by homologous recombination may reduce the chance of disruption. Indeed, I find that these sHGT-derived genes (excluded from the main set of HT genes; see Chapter 3, Section 3.4) are more promiscuous in their chromosomal location (Figure 4.1E). Although sHGT genes are enriched around the terminus, their depletion in the flanks is highly reduced and there is also a surprisingly distinct peak immediately downstream of the origin. However, the sHGT results should be treated with caution since there is a higher

Chapter 4 4.3.1 119 Chapter 4 4.3. RESULTS

Figure 4.1: Topology of HGT into GPA clade. (A) Density profiles illustrating the relative enrichment/depletion of HT (green/red) and vertical (yellow) genes across length-normalised GPA genomes. Overall gene density is given by the black line. HT genes predicted at T4. (B) Absolute number (and proportion) of genes classified as HGT (old and recent), vertical or belonging to the grey zone. (C) HT gene density enrichment profiles, as in (A), across other transfer costs. (D) Recent HT gene density enrichment profile (T4) (E) sHGT gene density enrichment profile (T4).

Chapter 4 4.3.1 120 Chapter 4 4.3. RESULTS likelihood of the gene set containing false positive vertical genes. In summary, genes derived from HGT are clearly spatially patterned with respect to the GPA chromosome. Given that recently transferred genes are rapidly eliminated from regions that are enriched for vertical genes, the genome seems to be divided into zones that are restrictive and permissive to HGT. Clustering of HT genes around the terminus is in line with the HGT-rich region identified previously in E. coli (Lawrence and Ochman 1998; Rocha 2004; Zarei, Sclavi, and Cosentino Lagomarsino 2013). However, what is the evolutionary driver for having spatially distinct HGT-rich zones, closer and more distal to the origin of replication, and what is the source of the mirror symmetry around the origin-terminus axis?

4.3.2 Inversions around origin-terminus axis drive symmetry

Next, I considered the source of the origin-terminus symmetry in the ob- served pattern. To ensure that the symmetry was not caused by a batch effect of combining the gene locations across multiple GPA genomes that might have HGT enrichment on one or the other replichore, I plotted gene locations for each indi- vidual chromosome. Although the decreased number of HT genes available for each plot (N ~300-500) resulted in a ‘spikier’ appearance, the same trend was observed in almost all single GPA genomes (Sup. Fig. 1). Notably, however, two genomes appeared to have experienced significant rearrangements. Although centred on the dnaA gene, Geobacillus sp. Y412MC61 and Anoxybacillus amylolyticus appeared to have undergone recent asymmetric inversions that shifted the entire terminus region and generated replichore lengths that were very imbalanced. The shift of the termi- nus region, at least for G. sp. Y412MC61, was supported by the non-polar-opposite location of the dif site predicted in the DoriC database (Gao, Luo, and Zhang 2013). Although the pan-genome pattern remained highly consistent regardless of the in- clusion of the two rearranged genomes (Figure 4.2A), I conservatively removed these two genomes from further spatial analysis. A previous study of HGT chromosomal

Chapter 4 4.3.2 121 Chapter 4 4.3. RESULTS distribution similarly excluded genomes in which the ratio of the predicted replichore lengths exceeded 1.2 (Oliveira et al. 2017). The other source of symmetry could be the prevalence for GPA species to undergo rearrangements around the origin-terminus axis (see Chapter 1, Section 1.2.4). To investigate this, I looked at the positions of genes belonging to 1328 fam- ilies comprised of 1-to-1 orthologues in the 23 GPA genomes. To determine only large scale rearrangements, a link was drawn between orthologues if the distance between them was at least 5% of the normalised genome size (see Methods, Sec- tion 4.2.3). As can be seen from Figure 4.2B, the vast majority of links appear symmetrical around the origin-terminus axis, suggesting that symmetric inversions are indeed the dominant form of rearrangement in GPA species and explaining the symmetrical pattern of HT and vertical genes. Critically, a region nearly symmetric around the origin appears to have not undergone any large scale rearrangements in GPA, indicating an adaptive advantage to maintaining orientation in the vicinity of the origin. Based on the observed pattern, I subdivided the GPA pan-genome into zones depending on what gene class (HT, vertical, neither) was enriched in that region (Figure 4.2C). In this linear representation, I designated a symmetrically distributed span (“Far Origin”, grey colour in 4.2C) in which neither HT nor vertical genes consistently dominated. However, before I continued to investigate the distinct distribution of HT and vertical zones with respect to their distance from the origin, I wanted to see to what extent these zones were coherent with classical regions of HGT enrichment - genomic islands.

4.3.3 HGT zones encompass genomic islands

As discussed in Chapter 1 (Section 1.2.4), GIs are generally identified as large (>10kb) regions of DNA that experience high-levels of HGT-driven gene turnover and can often differ between strains of bacteria. Particularly, I was in-

Chapter 4 4.3.3 122 Chapter 4 4.3. RESULTS

Figure 4.2: HGT-rich zones and origin-terminus symmetry. (A) Density profiles illustrating HT gene enrichment/depletion, as in 4.1A, for 23 non-rearranged GPA genomes (see text). (B) Link plot showing the tendency of genes to invert symmetrically around the origin-terminus axis within the GPA clade. Each link (single line) between two orthologues is coloured red at one end and blue at the other. See main text and Methods (Section 4.2.3) for details. (C) HT and vertical gene density (data as in (A)) visualised on a linear plot. Zones, named at top, are defined symmetrically either side of the origin by the prevailing gene type. Note: Far Origin zone where neither HT nor vertical genes are consistently enriched.

Chapter 4 4.3.3 123 Chapter 4 4.3. RESULTS terested whether GIs might be responsible for the HGT-rich zones adjacent to the origin of replication (Near Origin), given previous observations that pathogenicity islands are often found near the origin (Touchon, Bobay, and Rocha 2014; Westers et al. 2003) - presumably to take advantage of increased expression. In lieu of carrying out de novo GI detection, I made use of a publicly available database of precomputed GI locations - predictions were available for 17 of the 23 GPA genomes (Bertelli et al. 2017; see Methods, Section 4.2.4). I found that of the 6824 HT genes across the 17 genomes, only 1619 (24%) fall within predicted GIs (Figure 4.3A). However, when I analysed where the genomic islands lie across the GPA pan-genome, I found that 61% of GI genes lay within the HGT zones and a further 21% within the Far Origin zone that was not clearly enriched in either HT or vertical genes. To support the above observation, I found that while GI- genes showed strong peaks in both the origin-proximal and terminus-proximal zones of HGT enrichment, their absence did not alter the overall pattern of HGT gene enrichment (Figure 4.3B). In other words, the broad HGT-rich zones encompass the majority of the islands but it is not the GIs that delineate the zones, rather the genomic islands find a haven, alongside non-GI genes, within the broader HGT- permissive zones. Interestingly, a higher proportion of the HT genes lying within GIs were derived from recent transfer events (33% compared to 15% in all HT genes across 17 GPA genomes; Figure 4.3C). Further, HT genes within GIs had a significantly higher AT content (Figure 4.3D) which is expected since distinct nucleotide composition is a common predictor used in identification of GIs (Bertelli et al. 2017). As discussed in Chapter 1, GIs are often associated with the presence of boundary elements. In line with this, I found that tRNA density was highest around GIs - as illustrated for G. kaustophilus in Figure 4.3E. These observations, suggest a role for GIs as landing pads for new arrivals whose nucleotide content may restrict their integration elsewhere; over time, these may then disperse across the broader HGT-permissive

Chapter 4 4.3.3 124 Chapter 4 4.3. RESULTS

Figure 4.3: HGT zones encompass genomic islands. (A) Representation across GPA genomes of HT genes inside and outside GIs (left) and in different genomic zones (right) as defined in Figure 4.2C. (B) Density profiles illustrating enrichment/depletion of HT genes, as in 4.1A, within (left) and outside (right) GIs. (C) GIs are enriched for recent HGT events (right) compared to the genome-wide average (left). (D) GI HT genes have a lower average GC content than HT genes outside GIs. (E) Distribution of HT genes and other features along the G. kaustophilus genome. Zones correspond to those defined Figure 4.2C.

Chapter 4 4.3.3 125 Chapter 4 4.3. RESULTS zones. So, while HGT-permissive zones encompass GIs, they are not defined by them. Instead, both the origin-proximal and terminus-proximal HGT zones appear to be mixed bags comprised of GI genes, recently acquired HT genes (outside of GIs), and HT genes derived from ancient transfer events. To what extent, however, are these two locations (with respect to the origin) differentially structured? Is the Terminus zone a more relaxed ground for gene integration given its distance from the origin of replication and possible expression dosage effects?

4.3.4 HGT zones are equally selective of incoming genes

The tendency for genes to be preferentially located on the leading strand is a well-established facet of bacterial genome organisation (see Chapter 1). Thus, changes in the strand bias of genes belonging to the HGT-permissive and restrictive zones could indicate the importance of conserved genome architecture. I had a prior expectation that vertical genes should be predominantly po- sitioned on the leading strand. Since these genes had persisted in the genome since at least the last common ancestor of the GPA clade, selection would have had am- ple time to act on their orientation to minimise disruption caused by conflicting directions of replication and transcription. Accordingly, I found that vertical genes in vertical zones (Origin and Flank) were much more likely to be on the leading strand (Figure 4.4A). Conversely, leading strand preference was reduced, but not abolished, for vertical genes in non-vertical zones (Near Origin, Far Origin, Termi- nus). So, leading strand bias is greatest in zones dominated by vertical genes, even if they are further removed from the origin. This observation is consistent with the fact that it is gene essentiality (essential genes will be vertical genes) rather than expressiveness that is highly associated with leading strand preference (Rocha and Danchin 2003b). Surprisingly, genes derived from both old HGT events and, critically, recent

Chapter 4 4.3.4 126 Chapter 4 4.3. RESULTS

HT genes also showed a leading strand preference of similar magnitude as vertical genes (Figure 4.4A). Assuming that that initial integration should be equally likely to occur on either strand, an even distribution of recent HT genes amongst the leading and lagging strands would be expected in the absence of selection. My observation to the contrary indicates the presence of a rapidly enforced selection barrier for gene orientation at the point of, or immediately following, integration. It is interesting to note that while leading strand preference is lower for both vertical and old HT genes outside the vertical-rich zones, recent HGTs continue to closely track the reduced bias quite faithfully. This observation suggests that even outside the zones harbouring conserved, likely essential, vertical genes, local expression architecture is still sensitive to disruption. Furthermore, if strand bias is used as a measure of zone organisation, there is no obvious distinction between the Near Origin and Terminus HGT-permissive zones. In comparison to gene orientation, gene nucleotide content does not appear to track the HGT-permissive and restrictive zones in a patterned way. Again ap- plying a pan-genome approach, I find that the average nucleotide content of vertical genes remains fairly consistent over the chromosome (Figure 4.4B). There is a clear AT-richness at the immediate origin, presumably to aid with DNA melting during replication initiation, and a shallow decrease in GC content around the terminus. However, the GC content of vertical genes in the HGT-rich Near Origin zone is not depressed, suggesting that nucleotide composition of core genes, unlike their orientation, is not patterned by zone. By contrast, HT genes show a much more variable GC content, albeit one that is on average lower than that of vertical genes at the same genomic locations (Figure 4.4B). However, this variability does not appear to correlate with HGT-rich and poor zones. Unlike for vertical genes, the GC content of HT genes at a given location is often quite variable in the underlying genomes (fine green lines, Figure 4.4B). Thus, any true local bias in nucleotide content, e.g. caused by the presence

Chapter 4 4.3.4 127 Chapter 4 4.3. RESULTS

Figure 4.4: Selection on strand and GC content across HGT zones. (A) Relative leading/lagging strand enrichment of vertical and HT (old and recent) genes across 23 length-normalised GPA chromosomes. (B) Changes in normalised GC content of HT and vertical genes across GPA chromosomes. Thick lines are aggregate over 23 genomes, fine lines represent individual genomes. (C) The GC content of old but not recent HT genes correlates with the GC content of their vertical neighbours. For calculation details of all panels see Methods (Section 4.2.5)

Chapter 4 4.3.4 128 Chapter 4 4.3. RESULTS of GIs, might in fact be obscured in the pan-genome analysis. To further test whether there was any evidence for local nucleotide content as a barrier for gene integration, I investigated whether the GC content of HT genes tracked the GC content of their vertical neighbours. For old HT genes, I found that the HT genes indeed conformed to the local GC content (Figure 4.4C). That is, more GC-rich vertical genes are surrounded by more GC-rich HT gene neighbours. However, I found no such pattern for recent HT genes - their nucleotide content appears agnostic to that of surrounding vertical genes. Thus, I see no evidence of strong initial selection for nucleotide content - but acquired genes do undergo amelioration to their local context over evolutionary time. In summary, consistent with the tendency of essential genes to be on the leading strand, I find that leading strand bias is greatest in zones dominated by vertical genes, even if they are further removed from the origin. Nevertheless, genes in HGT-rich zones are still predominantly on the leading strand, and the observation that recent HT genes match the local strand preference suggests strong, immediate selection on gene orientation at the time of integration. I see no evidence of such a filter for gene nucleotide content as incoming HT genes seem agnostic to local nucleotide context. Yet, over a longer time period, I find that HT genes ameliorate to match the local neighbourhood. Interestingly, I find no significant evidence to suggest that either the Near Origin and Terminus zones are more relaxed than the other with respect to their local expression architecture.

4.3.5 HGT zones are patterned by gene function

I next looked at whether HT gene presence in the origin-proximal or terminus-proximal HGT zone was patterned by the function of the gene product. To facilitate analysis and focus specifically on gene distribution between origin and terminus, here I combined gene positions across the two replichores given their extensive symmetry (see above).

Chapter 4 4.3.5 129 Chapter 4 4.3. RESULTS

Considering product function in terms of COG categories, I found signif- icant differences in the gene function enrichments of the zones. First, while the Origin zone is highly depleted for HT genes overall, the few translation genes (COG J) that have been transferred into GPA are found there (Figure 4.5A). This is in line with a previous observation that position of genes involved in translation and transcription is highly constrained by gene dosage effects in fast growing bacteria (Couturier and Rocha 2006). Second, I found that the Terminus HGT-permissive zone is enriched for genes encoding metabolic functions. These include carbohydrate metabolism (COG G), energy production and conversion (COG C), amino acid metabolism (COG E), inorganic ion transport and metabolism (COG P) and secondary metabolite biosyn- thesis, transport and catabolism (COG Q, Figure 4.5A). The post-translational modification, protein turnover, and chaperones cluster (COG O), is also highly en- riched in the Terminus zone. High incidence of metabolic HT genes near the ter- minus is consistent with a model of genome compartmentalisation in which genes required for fast growth at optimal conditions are origin-proximal, to take advan- tage of gene dosage effects, while those expressed during metabolic challenge are relegated towards the terminus. To support this model, I analysed patterns of gene expression in different media. As there was no publicly available large-scale expression data for any of the GPA species at the time of this analysis, my supervisor Tobias Warnecke pointed me to a publicly available dataset of B. subtilis gene expression (Borkowski et al. 2016). In the study, the authors performed whole genome gene expression analysis in four different media ranging from rich (CHG media) to poor (M9SE). However, before analysis gene expression, I had to ensure that B. subtilis genome architecture was comparable to that of GPA. To this end, I identified all orthologues of vertical genes between a single GPA species, G. kaustophilus, and B. subtilis str. 168 (N = 893 genes), and plotted their relative positions (Figure 4.5B, top). I found that

Chapter 4 4.3.5 130 Chapter 4 4.3. RESULTS the relative position of most vertical genes was remarkably well conserved between G. kaustophilus and B. subtilis, especially with respect to the genes in the Flank zones. Interestingly, the relative size (as a proportion of genome size) of the putative Terminus zone in B. subtilis was significantly reduced compared to G. kaustophilus. This enlarged Terminus zone in GPA could be adaptive in nature; GPA species inhabit a broad range of environments, with an almost global distribution, and a broader metabolic repertoire could be maintained to tackle the diverse ecology these organisms encounter. Given the synteny in genome architecture between G. kaustophilus and B. subtilis, I next looked at B. subtilis gene transcript levels across the chromosome (Figure 4.5B, bottom). My first observation was that the highest expression levels across all media were associated with regions harbouring predominantly vertical genes. Interestingly, I found that gene expression around the terminus in rich media (CHG) was the lowest across the genome, but on transition to poorer media (S, M9SE) expression of terminus-proximal genes increased substantially in a tiered manner, with the poorest media resulting in the highest expression. Overall, these observations highlight two things. First, increased expression in reduced media and high enrichment of metabolic genes at the terminus clearly suggests a role in condition-specific activation of metabolic genes that are not necessary in optimal conditions. Second, gene expression patterns in these Bacillaceae species is not solely determined by fast growth - yes, genes at the immediate origin are highly expressed, but there is no clear gradient of gene expression running from origin to terminus. Instead, gene expression is highest for the conserved, likely essential vertical genes regardless of their genomic position, with the rest of the genome dominated by modular-looking expression domains. In the Near Origin zone I find an enrichment for genes encoding products involved in signal transduction (COG T), defence (COG V), and in transcription (COG K). The most pronounced enrichment, however, was evident in genes com-

Chapter 4 4.3.5 131 Chapter 4 4.3. RESULTS

Figure 4.5: Spatio-functional patterns of gene flow into the GPA clade. (A) Global topological patterns of HGT across GPA chromosomes (top) are decomposed by COG functional class (bottom). Left and right replichores have been combined for simplicity. High HGT densities for COGs I, H, and J are associated with relatively few genes, thus contributing little to overall HGT density. COG categories were clustered hierarchically based on the proportion of HGTs in the respective COG found in each zone. See Methods (Section 4.2.6) for further details. (B) GPA-vertical genes in G. kaustophilus, representing the GPA clade, share high synteny with their orthologues in B. subtilis str. 168. Regions of high vertical gene density in B. subtilis, particularly in areas corresponding to GPA-defined Origin and Flank zones, are associated with higher average expression. Note the upregulation of expression in poorer media (lighter blue) evident around the terminus and elsewhere. Thick lines represent regression fits between replicate-averaged expression values and gene start positions. See Methods (Section 4.2.7) for further details.

Chapter 4 4.3.5 132 Chapter 4 4.3. RESULTS prising the cell wall/membrane biogenesis functional category (COG M); 49.7% of all HT genes within this category are found in this region corresponding to just 17% of the genome. In line with the assumption that genes of unknown function (COG S) and genes with no identifiable orthologues (COG ∇) will in reality include multiple functions, such genes are prevalent and broadly distributed across both HGT-enriched zones. Curious whether genes encoding particular functions might delineate the sharp transitions between HGT-rich and HGT-poor zones along the GPA chromo- some, I performed an exploratory analysis at the zone boundaries established in Figure 4.2C. Specifically, I looked at genes in the regions immediately upstream and downstream of all boundaries (0.75% of the normalised genome size either side of the boundary). Here, I did not find any strong differential patterning of gene prod- uct function. For example, while 8% of boundary genes belonged to the COG M category, and only 4% of all GPA genes were in the same category, such an increase is certainly not indicative of a global pattern that might define transitions between zones. In summary, I find that HT gene position in either the Near Origin or Terminus zone is strongly patterned by the function of the gene product. Metabolic genes are strongly enriched towards the terminus, where their expression is likely regulated in response to nutrient conditions. Notably, expression along the chromo- some is not entirely dependent on the replication-associated dosage effect observed for fast-growing bacteria - rather there appears to be a sequence of modular ex- pression domains. This then raises the question, if the Near Origin genes are not benefiting from elevated expression by their proximity to the origin, why are they there? Given the enrichment of genes involved in membrane biogenesis and tran- scription within this zone, I turned to a different facet of the GPA lifestyle - one responsible for the global ubiquity of GPA sampling: spore formation.

Chapter 4 4.3.6 133 Chapter 4 4.3. RESULTS

4.3.6 The Near Origin zone and sporulation

The initial stages of endospore formation in B. subtilis involve the construc- tion of a septum that asymmetrically partitions the cell, trapping approximately one third of a single daughter chromosome within the nascent prespore (Figure 4.6A; also see Chapter 1, Section 1.2.4). This physical compartmentalisation necessitates, for a time, an expression program that is specific to the forespore and orchestrated by the forespore-specific regulator, σF (Wang et al. 2006b). The σF regulon includes a number of genes involved in further transcription regulation (e.g. yabT and rsfA) and others that direct spore morphogenesis or are localised to the membrane (e.g. spoIIQ, tuaF, ywnJ, ywhE, and cydD). Wang and colleagues (2006) made two critical observations regarding σF- regulated gene positions in B. subtilis. First, they noted that the majority of genes were positioned on the left replichore, though with the acknowledgement that the source of this preference was unclear. Second, they identified a cluster of genes within a ~600kb window, asymmetric around the origin of replication, that overlaps with RacA binding sites which are essential for anchoring the origin region to the cell pole. These genes will be trapped in the forespore immediately following septation. In the first instance, I wanted to compare the positions of genes involved in B. subtilis sporulation with orthologues in GPA. Of course, only looking at conserved genes would omit forespore-specific adaptation in GPA. However, consistency in the conserved genes contributing to sporulation would suggest a consistent mechanism of sporulation and allow further parallels to be drawn. To achieve this, I found orthologues of all B. subtilis σF-regulated genes (as listed in Wang et al. 2006b) in the 23 GPA genomes. Of the 48 B. subtilis genes, I found that 43 had orthologues in GPA genomes and of these 31 genes had a single copy of the gene in each GPA genome - suggesting some degree of plasticity in the early forespore program. I then compared the positions of these 1-to-1 orthologues and found remarkable consistency between the prior observations in B. subtilis and

Chapter 4 4.3.6 134 Chapter 4 4.3. RESULTS

Figure 4.6: Forespore development genes positionally conserved in GPA. (A)) Schematic representation of HT and vertical zones in the forespore and mother cell during early spore formation. (B) Positions of σF-regulon genes shared between B. subtilis and all 23 GPA genomes are highly conserved. Gene locations along the chromosome are given by red points (B. subtilis) and vertical black lines (GPA genomes). Gene names are given on the y-axes. Note, gene positions are plotted from origin to origin (left to right) clockwise, hence the “right" and “left" replichore labels.

Chapter 4 4.3.6 135 Chapter 4 4.3. RESULTS my own in GPA (Figure 4.6B). The tendency of these genes to be positioned on the left replichore is similarly apparent in the GPA species, though replichore choice is not a strict constraint across the majority of the genome. Beyond the spoIIQ gene, σF regulon genes lie on either replichore - very likely the result of symmetric rearrangements around the origin, given the conservation of their distance with respect to the origin. In contrast, genes near the origin, from csfB to spoIIQ, appear on a single replichore consistently in all GPA species (Figure 4.6B). This strikingly coincides with both the cluster of genes identified as overlapping with RacA sites and also the edge of the Origin zone I defined above. These genes lie in an area that is prohibitive to even symmetric rearrangements, as can be seen in Figure 4.2B, as rearrangement of the asymmetrically distributed RacA sites may negatively impact the sporulation program (Ben-Yehuda et al. 2005). In summary, genes acting during the early sporulation program appear spatially conserved between B. subtilis and GPA species which indicates that similar mechanisms and constraints are likely to act on the GPA chromosome as a result of endospore formation. To further investigate the interplay between spore formation, gene func- tion, and the role of HGT I compared the locations of genes belonging to a particular COG category in other spore-forming and non-spore-forming Bacilli (see Methods, Section 4.2.8). If the enrichment of the COG M genes in the Near Origin zone in GPA is driven by physical constraints imposed by sporulation, I reasoned that the orthologues of these genes should show a similar positional bias in sporulating organ- isms, but not in those without a spore-forming program. To do this, I divided the HGT-derived COG M genes into two sets - the first corresponding to those within the first 38% of the chromosome (coinciding with a general dip in gene density) and the other including the COG M genes from the remaining 62% of the chromosome (Figure 4.7A, top panel). Looking at the distribution of COG M orthologues from the first 38%, there is a pronounced enrichment of these genes towards the origin in spore-forming bacteria (Figure 4.7A, lower left panel), peaking in the HGT-rich Near

Chapter 4 4.3.6 136 Chapter 4 4.4. DISCUSSION

Origin zone, while no such patterning is seen in non-spore-forming species. COG M orthologues from the remaining 62% of the chromosome did not show any differen- tial bias in position between spore-forming and non-spore-forming genomes, though genes were depleted towards the origin in both cases (4.7A, lower right panel). I observed a similar, albeit weaker, trend for orthologues of COG K and T HT genes (Figure 4.7B). Based on my observations, I suggest that the asymmetric compart- mentalisation of one chromosome during the early stages of forespore formation has topologically patterned HGT into GPA species, and likely other spore-forming Bacilli, over evolutionary time.

4.4 Discussion

In this chapter I have investigated the topological biases shaping gene gain by HGT into the GPA clade. First, I showed that HT and vertical genes show a dis- tinct enrichment pattern around the GPA genome; notably, there are two HGT-rich spans on either side of the origin of replication (which is depleted of HT genes), and a broad HGT-rich region centred on the terminus. This pattern is consistent across different HT prediction thresholds (including the most relaxed), and is also evident, albeit weaker, for gene gain from phylogenetically close donors. Although the clar- ity of the observed pattern was certainly enhanced by inclusion of multiple filtering steps, I believe this topological bias should be detectable using any phylogenetic ap- proach that can differentiate between putative HGT-derived and vertically-inherited genes. Next, I determined that the source of the origin-terminus symmetry ob- served in HT and vertical gene topology was an abundance of symmetrical inversion around the origin-terminus axis, rather than a batch effect of combining data across GPA spp.. This was in line with previous work suggesting that symmetrical inver- sions were the most common mode of rearrangement in Bacillaceae species (Repar

Chapter 4 4.4.0 137 Chapter 4 4.4. DISCUSSION

Figure 4.7: HT gene topology may be driven by spore formation. (A) The top panel shows the relative density of vertical (yellow) and HT (green) COG M genes along GPA chromosomes. Left and right replichores have been combined for simplicity. The two lower panels show the topological distribution of orthologues of GPA COG M genes in spore-forming (red) and non-spore-forming (grey) Bacilli. Left: orthologues of those GPA genes that are located on the origin-proximal 38% of the chromosome. Right: orthologues of those GPA genes that are located in the terminus-proximal 68%. (B) Topological distribution of orthologues of GPA COG K (upper panels) and COG T (lower panels) in spore- and non-spore-forming Bacilli. Boundaries and colours as defined in (A).

Chapter 4 4.4.0 138 Chapter 4 4.4. DISCUSSION and Warnecke 2017). It was curious to see that a span centred on the origin of replication was devoid of symmetrical rearrangements - presumably gene synteny, or binding site density (as highlighted in Section 4.3.6), is critical to fitness in this zone. Nevertheless, this analysis step clarified that while I detected three “spans” of HGT-enrichment across the GPA pan-chromosome, the two on either side of the ori- gin corresponded to the same zone. Thus, I defined two zones of HGT-enrichment, differentiated by their distance from the origin - the Near Origin and Terminus zones. At this stage, I also found that two of the 25 GPA genomes in this study had experienced significant asymmetric rearrangement of their terminal region. At least in the case of A. amylolyticus, the isolate was successfully subjected to growth and expression analysis (Poli et al. 2006), raising the question of whether such sig- nificant rearrangements are real, and the replichore length imbalance is tolerated, or whether this is an artefact of lab domestication or assembly. Either way, the two genomes were omitted from further analysis. To determine how HGT-permissive zones overlapped with previously de- fined regions of high gene turnover, I compared zone boundaries to the locations of publicly available GI locations. I found that my zones of HGT enrichment encom- passed the majority of GIs but were not defined by them - rather GIs are components of broader regions with affinity for gene gain by HGT. Although this stage of the analysis focused on GIs, these are usually predicted by evaluating compositional metrics and boundary elements, and so those used in this analysis are not neces- sarily representative of regions undergoing increased gene turnover. How, though, do the zones detected in this work overlap with hotspots of gene turnover predicted by orthogonal approaches? Although the analysis of hotspot location with respect to distance from the origin by Oliveira et al. (2017) was aggregated over multiple taxa (including spore- and non-spore-formers), the observed distribution of hotspots lacking prophages and integrative elements (enrichment in what corresponds to the Near Origin and Terminus zones) was remarkably consistent with the one observed

Chapter 4 4.4.0 139 Chapter 4 4.4. DISCUSSION in this work (see Figure 6a, Oliveira et al. 2017). In light of this, it would be inter- esting to see to what extent the topology of these hotspots is driven by the inclusion of spore-forming species, or Firmicutes more generally, in their analysis. Next, I wanted to see whether the Near Origin and Terminus zones of HGT enrichment were equally permissive to gene integration. The fact that recent transfers, and GIs, are represented in both zones indicates that both zones do accept new HGTs; however, I wanted to explicitly assess whether barriers to integration differed across the genome. I determined that HT gene strand was under rapid selection following integration, and this was most enforced for HT genes coming into zones enriched for vertical genes. This suggests that local architecture is highly structured in the Origin and Flank zones and that the rare HGT integrants need to be minimally disruptive. In the Near Origin and Terminus zones the requirement to be on the leading strand is also present, though more relaxed. Importantly, I see no clear difference between the zones, suggesting local expression architecture is equally (un)constrained in both. In terms of nucleotide content, the barrier to integration is clearly low with recent HT genes not conforming to the GC content of their vertical neighbours across the genome. Here, it would be interesting to delve deeper to see whether a nucleotide composition preference does exist for the rare new arrivals into vertical-rich zones, but is obscured by the greater number of recent HT genes flowing into HGT-permissive zones. However, the number of such cases in this dataset is likely too low to draw meaningful conclusions. Nevertheless, this analysis clearly suggests that the flow of genes closer to the origin or the terminus is not driven by differential barriers to integration. Of course, composition of genes in either zone could be interrogated further (e.g. codon bias preference, gene length, etc.), but given the strong impact of gene function on zone preference I felt it unlikely that there would be any significant differences. Finally, I found that the HGT-rich zones were strongly patterned by the function of the genes. The Terminus zone stood out with a prominent enrichment for

Chapter 4 4.4.0 140 Chapter 4 4.4. DISCUSSION genes encoding metabolic functions. The function of the zone as a harbour for genes to be expressed under metabolic challenge is supported by the expression profile of the orthologous zone in B. subtilis. Thus, genes that are expressed on demand are preferentially positioned away from the origin which may, in turn, suggest that HT genes that would benefit from high expression under optimal conditions (replication- associated gene dosage increase) would then be present in the Near Origin region. Surprisingly, however, even in the richest media B. subtilis genes in the orthologous Near Origin zones are not expressed at the high level that might be naively expected from their proximity to the origin. Instead, highest expression is associated with the highly structured vertical-rich regions distal from the origin. This stage of the analysis would of course benefit from GPA spp. expression data - even though genome architecture appears constrained between GPA and B. subtilis, the relative sizes of the zones are different, suggesting some differential genome organisation. What, then, is the purpose of a Near Origin HGT-rich zone that is not taking advantage of elevated expression, i.e. why are the locations of the Flank and Near Origin zones not reversed? The enrichment of HT genes involved in membrane biogenesis in the Near Origin zone suggested a possible role in spore development. Given the known physical compartmentalisation of the chromosome during sporulation, I investigated whether spore-formation may be responsible for the observed HGT topology. Similar restriction of membrane-related HT genes in other spore-forming bacteria (but ab- sent in non-spore-formers) suggested that spore formation may indeed guide genome organisation. Such a topological arrangement makes sense, because genes that con- tribute to early forespore development will have to be expressed from the first third of the chromosome. However, available space immediately adjacent to the origin is limited and is mostly occupied by genes involved in core replicative functions and those that benefit from replication-associated dosage increase. Thus, conserved genes involved in σF-regulated forespore development cluster in the near-origin zone

Chapter 4 4.4.0 141 Chapter 4 4.4. DISCUSSION

(Wang et al. 2006b). Innovation of spore morphogenesis by HGT suffers from the same space limitations: they avoid integration at the immediate origin but still need to be in that prized first third of the chromosome trapped by septation, and this may be particularly key for membrane proteins that are vital for endospore formation (Kim and Schumann 2009). Membrane synthesis is a continuous process through- out early spore formation, including during the period following septation when contribution from the mother cell is limited (Lopez-Garrido et al. 2018), so local ex- pression may be critical for some genes. In addition, membrane proteins can rely on co-translational folding and concomitant insertion into the membrane which would also require compartment-specific expression. Comparative genomic studies suggest the importance of HGT for sporulation innovation: although the commitment steps to sporulation (initiation, septation) are highly conserved across the Firmicute phy- lum, this is not the case for later forespore and mother cell development (De Hoon, Eichenberger, and Vitkup 2010; Galperin et al. 2012). Thus, I suggest that the abil- ity to evolve the sporulation program depends on an HGT-permissive region near the origin that would allow for compartment-specific expression during spore for- mation. If such a permissive zone is established, gene flow into this region would be open to other genes, some performing non-spore-related functions. However, given the on-going constraint on zone size, one would expect spore-specific functions to distil in the Near Origin zone with HT genes encoding other functions to cluster outside of this prime real estate.

Chapter 4 4.4.0 142 Chapter 5

Discussion

143 Chapter 5 DISCUSSION

In this thesis I have presented the first comprehensive analysis of gene flow into the GPA clade, with a specific focus on spatial biases that shape HGT arrival and distribution within the genomes of these microbes. In Chapter 2, I employed a phylogenetic approach to robustly identify both a set of HGT events that putatively contributed to GPA lineage evolution, and vertical genes that had been inherited by GPA species from their last common ancestor with other Bacillaceae (and have not been affected by HGT within the clade). By augmenting canonical steps re- quired for phylogeny-based HGT inference to benefit a clade-specific approach, I significantly increased the overall speed of the analysis without sacrificing accuracy. In Chapter 3, I considered my inferred HT and vertical genes in the context of previously described patterns that typically distinguish HT and non-HT genes, and found that my predictions were in line with expectations. I also determined the tim- ing of gene acquisition into GPA which revealed a broadly continuous, rather than punctuated, gene flow throughout lineage evolution. In Chapter 4, I analysed the spatial distribution of HT and vertical genes within the GPA genomes to find a dis- tinct pattern - with HT and vertical genes preferentially occupying separate zones. Of the gene characteristics investigated, it was gene product function that differ- entiated the origin-proximal and terminus-proximal HGT-permissive zones. From this observation, I identified a likely link between the location of HT genes and endospore formation, a crucial developmental program that is regularly utilised by GPA species. In this chapter I summarise the main results of this thesis, discuss their limitations and suggest improvements, and comment on their implications for the broader field. I go on to briefly comment on my findings in the context of genome engineering and the relative importance of gene characteristics and target selection in genetic modification attempts. I conclude by considering future work that would further support and clarify the interplay between gene flow, genome organisation, and developmental programs.

Chapter 5 5.1 144 Chapter 5 DISCUSSION

5.1 Clade-specific HGT inference

In the first analysis chapter my aim was to apply a method to conservatively infer a set of HGT events that contributed to the evolution of the GPA clade. The ability of phylogeny-based methods to identify ancient transfer events, even where gene signatures (such as nucleotide content) have been masked by millions of years of evolution in the host, made it the preferred choice over compositional methods. Within that, I opted for a reconciliation approach to discern putative HGT events in the comparison of gene and species trees; reconciliation not only considers tree discordance in the context of alternative evolutionary events, but also allowed me to modulate detection stringency. Faced with the high computational cost typically associated with phylogeny-based methods, I exploited my specific objective to detect gene flow into GPA to reduce required time and resources while maintaining the breadth of input genomes to maximise the chances of identifying orthology. The aim of my thesis was to empirically explore gene flow in the context of genome architecture, and so the methods were developed in an effort to enable downstream analysis rather than optimise workflows. As a result, I carried out little benchmarking of the methods. So, while the principles developed here seem reasonable, as does the detected HGT output, benchmarking would be required to endorse the methods to the broader field. I achieved increased computational tractability by considering what is nec- essary to differentiate between a horizontal transfer and a history of vertical inher- itance patterned by loss: context. To determine the source of a gene in a given genome, one needs to look at which genomes possess genes with the highest sim- ilarity - if these are distant relatives, HGT into the genome of interest is a likely explanation. If there is sufficient phylogenetic context to determine the origin of a GPA gene, the presence of extra low-homology orthologues adds little further value yet increases computational cost. By applying this rationale, I managed to reduce the size of some of the largest gene families significantly. This resulted in an over-

Chapter 5 5.1 145 Chapter 5 DISCUSSION all reduction of sequences requiring phylogenetic reconstruction by 42% without comprising the number of gene families or GPA genes included. Next, I reduced computational cost by using a less accurate tree reconstruction algorithm - a choice justified by comparing HGT inference output from a previous dataset, revealing insignificant downstream differences after using more and less robust tree building approaches. Focusing on gene flow into GPA also allowed me to make savings in reconstructing the reference tree. Using the gene trees, I applied a coalescence- based approach to produce a phylogeny that was as expansive as the number of taxa in the underlying gene trees. The accuracy in the local phylogeny of the taxa closely-related to GPA was critical to avoid spurious HGT inference into GPA, so I employed a concatenation-based method to correct for any error - a step that turned out to be largely unnecessary. Applying the reconciliation approach at four detec- tion thresholds produced four sets of HGT and vertical gene predictions of increasing stringency. These putative HGT events were further filtered for consistency across the thresholds, which included the removal of HGT events predicted to have been gained from phylogenetically close taxa (sHGTs). Finally, I coarsely partitioned the HGT events by age. A constant struggle in modern computational biology is to improve both the accuracy and speed of calculation(s). Unfortunately, these two aims are often antagonistic. In light of this, my work in Chapter 2 is a significant step forward for detecting HGT into a particular group which can be useful for both probing the evolutionary history of genomes of interest, and for producing non-aggregate patterns of gene flow that can inform targeted synthetic biology efforts. It is diffi- cult to estimate the increase in overall pipeline speed from one which did not take shortcuts afforded by inferring HGT into a particular group. However, anecdotal evidence from a previous dataset (see Chapter 2, Section 2.3.1) showed an overall runtime that was reduced from months to weeks - though a significant contribution to that was choosing a faster method to reconstruct phylogenies. The scale of the

Chapter 5 5.1 146 Chapter 5 DISCUSSION speed-up could be assessed by selecting a number of gene families and running the pipeline with and without my augmentations and then comparing overall running time. Critically, this would also show the robustness of downstream HGT inference to reducing gene family size - a key constant that needs to be maintained for the reduction approach to be a valuable tool. However, there are confounding factors. As mentioned in Chapter 2, reducing gene family size, i.e. removing low homology sequences, could in fact improve alignment quality and alter the resulting phylogeny in unpredictable ways - which in turn would affect the reconciliation-derived gene history. Another obvious limitation is the selection of an arbitrary cutoff for context - in my case this was 200 total sequences that had to include all GPA sequences present at the lowest threshold. I did not iterate over different cutoff values and, given more time, this would be a useful experiment to determine how downstream HGT inference and computation time is affected by varying this parameter. Fur- ther, more genomes in the group of interest would necessitate a larger context, e.g. if the GPA comprised 70 genomes, a minimum context cutoff of 200 would leave just 130 genomes to infer gene flow from outside the GPA. Thus, iteration might reveal a good rule of thumb by which to scale required context for groups of interest of varying size. Finally, given more time I would have liked to improve my partition of HGTs by age of acquisition. While my approach is effective for those genomes that share a close relative within the GPA, i.e. belong to the same genomospecies, it left a number of genomes for which recent HGTs could not be detected (and so were included in the old HGT set). Here HGT detection with a complementary compo- sitional method could have been effective - all genes detected by such an approach, with an appropriate threshold, could be considered as recent transfers. Overall, however, the efficacy of my approach must be considered with respect to the quality of HT and vertical genes predicted. In Chapters 3 and 4 I find robust patterns associated with these gene classes. For patterns that had been previously observed, I find high coincidence between my data and prior works

Chapter 5 5.1 147 Chapter 5 DISCUSSION suggesting good, if stringent, HGT inference. As such, the augmentations to the phylogeny-based approach I presented in Chapter 2 could become a valuable addition to the HGT detection toolkit, if appropriate benchmarking was undertaken.

5.2 Assessment of HGT into GPA

In the second analysis chapter of the thesis, my aim was to evaluate the HGT events predicted in the previous chapter. There are many patterns associated with both HGT-derived and more highly conserved, vertically-inherited genes across the prokaryotic tree of life (see Chapter 1, Section 1.2). These allowed me to assess the quality of my predictions by comparison. Essentially, if my HT and vertical genes displayed biases that were a priori expected, based on previous studies of genes of such heritage, then they might be considered suitable for further analysis. Of course, similarity in observed patterns does not confer certainty in the fidelity of prediction. However, in lieu of using orthogonal prediction methods to get a ‘second opinion’, this approach would give a first indication of HGT inference quality. I turned to the patterns that most robustly characterise HT genes - the function of the protein product and gene nucleotide composition. If patterns were observed, my next aim was to see whether I could distinguish any differences between HGT predictions made at different stringency thresholds. This would allow me to find a balance between the perceived quality of the predicted HGT events and their num- ber. Finally, I also wanted to evaluate gene flow into GPA in light of less ubiquitous patterns associated with HGT: the timing of gene gain over lineage evolution, and the most frequent sources of these genes. I first investigated what product functions were enriched in my HGT- derived genes versus the predicted vertically-inherited genes. I found that, in line with previous works, HT genes were enriched for defence (COG V) and carbohy- drate metabolism (COG G) functions, amongst others, and strongly depleted for

Chapter 5 5.2 148 Chapter 5 DISCUSSION genes encoding proteins involved in translation (COG J). Further, this encouraging pattern was consistent across all stringency thresholds - even evident at biologi- cally unreasonable thresholds, queried with a prior dataset. Exclusion of sHGTs in the filtering steps in Chapter 2 was justified here based on my observation that the pattern of functional enrichment was much more dilute. This could be due to transfer of genes with higher product connectivity being better tolerated if com- ing from similar genomes, or greater probability of homologous replacement due to shared boundary sequences. However, I could also not rule out the contamination of the sHGT dataset with vertical genes due to phylogenetic error. Next, I looked at the nucleotide composition of putative HT genes and, again in line with prior works, found that HT genes were typically more AT-rich than vertical genes in the same genome. This trend was observed in all genomes, despite a variation in mean genome GC content of 10% within the GPA. The AT-richness was greater for re- cent HGTs, suggesting that, over time, HT genes ameliorate to the host genome nucleotide content. Again, the observed patterns were consistent across all strin- gency thresholds. My observation of AT-biased gene gain in GPA adds to a number of reports highlighting general GC-poverty of transferred genes relative to genomic GC content in select groups (see Chapter 1), yet a comprehensive and systematic assessment of comparative AT-richness of HT genes has not yet been undertaken. It would be interesting to see to what extent this bias is universal across microbial HGT, which in turn may lead to a more informed hypothesis on the forces driving this phenomenon. The function and nucleotide-based observations showed that my HT and vertical gene sets conform to prior expectations, and that choice of reconciliation parameters appears to be qualitatively irrelevant, at least within the range tested here. Next, I looked at the temporal patterns of gene flow into GPA. Making sure that the results are not biased by reference tree branch lengths during reconciliation, I found a striking correlation between GPA branch lengths (a proxy for evolution-

Chapter 5 5.2 149 Chapter 5 DISCUSSION ary time) and the number of transfers at that branch. This observation supported a continuous model of gene acquisition over GPA evolution, though the contribution of small bursts of gene gain cannot be ruled out. Interestingly, I found that recent HGT events did not correlate with branch length. If gene acquisition is a stochastic process, this stochasticity may be more evident at shorter timescales. Only con- sideration over longer periods of evolution can then reveal the largely continuous nature of gene gain. Finally, I briefly looked at the most common likely donors of HT genes into GPA. I found that whilst the majority of all transfers were pre- dicted to originate from closely related taxa within the Bacillaceae, phylogenetically distant acquisitions were from groups that had similar growth characteristics and environments - underlining the importance of shared ecology in determining gene flow. A number of improvements could be made to augment the analyses in Chapter 3. First, while functional enrichment patterns for HGTs have classically utilised the quite broad COG system of classification, which appeared sufficient for the purpose of qualifying the dataset, it would be interesting to see a more fine-grained breakdown of functional preference in GPA gene gain. The EggNOG framework I used to assign function does provide functional sub-classification into more discrete groups. However, the challenge is to connect genes of certain function in a biologically logical context, e.g. in pathways. For this purpose, I could have applied the Kyoto Encyclopedia of Genes and Genomes (KEGG) system (Kane- hisa et al. 2017; Ogata et al. 1999), or similar systems of classification, to link HT gene gain to pathway evolution. Despite the compelling strength of the correlation between gene gain and branch length, I could extend the temporal analysis to in- clude transfers from closely related donors (sHGTs). Based on the results presented, my a priori expectation would be that sHGTs similarly show an evolutionary his- tory dominated by continuous rather than punctuated gene gain. Further, without delving into Bayesian approaches to phylogenetic reconstruction, I could further im-

Chapter 5 5.2 150 Chapter 5 DISCUSSION prove the GPA reference tree I used for the temporal analysis. Instead of using a pruned subtree of the broader Bacillaceae phylogeny, I could reconstruct the GPA tree using 1-to-1 orthologues derived from the vertical gene set identified as part of the reconciliation. This should increase the number of informative sites available for phylogenetic calculation, since there would be more genes included. However, I suspect that the relative branch lengths will not change significantly from the phy- logeny used here. Finally, as alluded to in Chapter 3, the donor analysis is subject to distinct limitations. Without explicitly testing the accuracy of the reference tree outside of the Bacillaceae, the sequence of events outside of those predicted to arrive into GPA could be poorly inferred, which may lead to spurious donor prediction. I attempted to mitigate this to some extent by identifying donors at broad taxonomic levels (Families, Classes). However, a gene-by-gene analysis which considers the co-transfer of genes would reveal more specific donor-recipient partnerships. This would be particularly interesting in the context of the results presented in Chapter 4 - have certain donors been preferred as sources of metabolic functionality at the terminus, or spore-related functions near the origin of replication? Despite the opportunity for further analyses, the core goals of Chapter 3 were met by qualifying the HT and vertical gene patterns, as well as touching on the timing of gene gain and sources of genetic diversity. With improved confidence in the fidelity of differentiation between HT and vertical genes, I could undertake further exploratory analysis of spatial patterning in Chapter 4.

5.3 Topology of HT genes in GPA genomes

In the final analysis chapter, I aimed to use the HT and vertical gene sets, predicted and evaluated by me in prior analyses, to discern any bias in their distribution along the GPA chromosome. A recent aggregate analysis suggested that the distribution of HT genes may be more complex than simple sequestration at the

Chapter 5 5.3 151 Chapter 5 DISCUSSION terminus, where they would be less disruptive (but not disruption-free, Hendrickson et al. 2018) to replication-associated processes (Oliveira et al. 2017). In light of this, having established that there is a distinct topology of HT genes in the GPA genome, I went on to investigate the possible causes of this adaptive genome architecture. My first observation was that HT and vertical genes indeed occupied dis- tinct spatial domains across the GPA chromosome. Specifically, HT genes were enriched in a broad domain around the terminus, but also in two symmetrical zones near the origin of replication. Vertical genes, on the other hand, dominated the region at the immediate origin but were mostly relegated to two extensive, sym- metrical flanks on both replichores that were distal to the origin. The role of this spatial bias in adaptive genome architecture was further supported by observations that both recently gained genes and HGTs from closely related taxa adhered to the pattern. Next, I found that the striking symmetry was in fact the result of exten- sive inversions around the origin-terminus axis over GPA lineage evolution, in line with previous observations of genome rearrangement in the Bacillaceae (Repar and Warnecke 2017). At the same time, I confirmed that the aggregate GPA pattern was in fact recovered in the individual underlying genomes. Thus, really there were two zones of HGT-enrichment, one at the terminus and one proximal to the origin. Both these zones encompassed genomic islands, regions associated with high gene turnover and short-term adaptation, but were not delineated by them. Instead, GIs inhabited these broader HGT-permissive zones alongside more anciently acquired genes. Curious as to what determines HT gene inclusion in one or the other zone, I first looked at whether the zones were equally accepting of incoming genes. I found that in both zones, genes integrated on the leading strand with approximately equal frequency, though this was less strict than for the vertical zones, suggesting local expression architecture was not differentially prohibitive to integration between the HGT-rich zones. Further, local nucleotide composition did not bias gene acquisition - though HT genes did ameliorate to match

Chapter 5 5.3 152 Chapter 5 DISCUSSION it over longer evolutionary timescales. In fact, it was gene product function that seemed to differentiate between the origin-proximal and terminus HGT-permissive zones. HT genes encoding products involved in metabolism were primarily posi- tioned near the terminus. This bias matched the finding that terminus-located genes experienced elevated expression in metabolically challenging conditions in B. subtilis, which appeared to share the same genomic zone pattern. Functional enrich- ment in the origin-proximal zone was less straightforward, but finding many genes involved in membrane and cell wall biogenesis, I suspected a link to a developmen- tal program. Indeed, the chromosomal region entrapped in the nascent forespore within the early stages of the sporulation program closely matches the boundary of the origin-proximal HGT-rich zone. I hypothesised that HGT-driven innovation of spore formation, an essential facet of GPA survival in the wild, might need to occur in the limited real estate of the chromosome that is trapped within the early forespore. To test this, I looked at the orthologue distribution of genes located in the first third of the GPA chromosome in both spore-forming and non-spore-forming species. I found that in spore-forming species, orthologues were likewise skewed to- wards the origin of replication while in non-spore-forming species there was no such pattern. This tentatively suggests that it might be innovation of sporulation that necessitates the existence of an HGT-permissive zone near the origin, which could then also harbour other non-spore related functions. The most important caveat in the results presented in Chapter 4 is that, despite enrichment in membrane biogenesis-associated genes in the Near Origin zone, and similar patterns of these genes in other spore-forming, but not non-spore- forming genomes, I cannot definitively implicate sporulation as the cause of the observed HT gene pattern. Ideally, I would have addressed this within my work if more time was available. In Section 5.5, below, I outline orthogonal approaches to investigate the persistence of this HT gene topology across spore-forming and non-spore-forming groups, and to dissect the importance of the Near Origin zone in

Chapter 5 5.3 153 Chapter 5 DISCUSSION harbouring sporulation-associated HT genes in GPA. Another limitation of this chapter is not considering GPA genes in the con- text of operons. The transfer of genes in operonic blocks is expected: some genes only provide functional benefit when expressed in combination with their partners (Rocha 2008). This has been supported by observations of spatial and metabolic clustering in HT genes (Dilthey and Lercher 2015). When operonic genes are highly interdependent, their inferred evolutionary history is more likely to be the same, since they would be gained, lost, and duplicated as a unit. So, I could attempt to define co-transferred operons by grouping adjacent genes sharing prediction charac- teristics, such as branch of gain and type of transfer (e.g. lHGT). However, anecdo- tal evidence from my dataset has shown neighbouring genes, likely part of operonic units, are often predicted to have divergent histories. In part this is due to the high stringency requirements in HGT inference, ensuring the purest HT gene set possible. Alternatively, there will be cases where a gene might optionally augment operonic function, and so be lost and gained independently of neighbouring genes, resulting in differing gene trees. Nevertheless, defining operons that were acquired at different chromosomal locations, at different points during lineage evolution would allow me to refine what functionality was gained, in both spatial and temporal con- texts, rather than relying on the broad strokes of functional enrichment afforded by the COG-based approach. To achieve this, I would perform operon identification with a composition-based approach, e.g. the recently published “Operon-mapper" (Taboada et al. 2018). I would combine this with GPA expression data, to reveal co- transcribed units (at least under certain growth conditions); in addition, this would also augment the expression analysis that currently uses B. subtilis data (Chapter 4, Section 4.3.5). Operons could then be designated by the dominant evolutionary history ascribed to the included genes, hopefully revealing a more resolved interplay between function, genome architecture, and GPA lineage evolution. Overall, the work in this chapter reveals a previously uncharacterised pat-

Chapter 5 5.3 154 Chapter 5 DISCUSSION tern of differential HT gene distribution in genomes of the GPA clade, and likely other related Bacillaceae. Determination that HT genes are topologically con- strained along the chromosome by their function and the link between gene location and the sporulation program are exciting additions to our knowledge of the com- plex dynamics governing genetic innovation by HGT in prokaryotic evolution. The results presented here would ideally be supported by further experiments to more precisely investigate the importance of the Near Origin zone for the sporulation pro- gram, outlined at the end of this chapter. These findings could also be practically leveraged as pointers for genome engineering, especially as members of the GPA clade are emerging as popular chassis choices for industrial applications.

5.4 Evolutionary lessons for synthetic biology

As discussed in Chapter 1.3, the genus Geobacillus has been one target of the recent drive to diversify available chassis for genetic engineering. In this context, more tools are becoming available for both plasmid-based augmentation (Reeve et al. 2016) and, more importantly in the context of this work, toolkits for rapid and site-specific engineering of chromosomal genes (Bacon et al. 2017; Cripps et al. 2009; Sheng et al. 2017). Furthermore, development of tailored promoter and ribosome binding site (RBS) genetic toolboxes for Geobacillus can allow highly tunable ex- pression of native and heterologous genes (Pogrebnyakov, Jendresen, and Nielsen 2017; Reeve et al. 2016). Taken together, the stage is set for efficient integration of desired genes into the Geobacillus chromosome, so what key facets of natural HGTs should be considered in future engineering efforts? Certainly, nucleotide-level factors can be important. It is still not entirely clear why organisms on average acquire more AT-rich DNA, given the ability to instead integrate genes that match the native nucleotide composition. However, my work and previous studies have shown that GC content is not a critical determinant

Chapter 5 5.4 155 Chapter 5 DISCUSSION of performance success. Nevertheless, tools for optimising genes at the nucleotide level, with respect to GC content and native codon usage, are widely used and their development benefits from more widespread analysis of host patterns (Gustafsson et al. 2012; Spencer et al. 2012). While the number of host-specific promoter and RBS toolkits is increasing, and these do make fine-tuning of product quantity possible, global genome architecture should also be considered depending on the level of expression desired. For example, even in optimal media, gene expression is domain- based rather than experiencing a linear decline from origin to terminus (see B. subtilis data in Figure 4.5B). So, if high expression is desired, even with a tunable promoter it is likely that the cassette would fare better in a domain capable of native high expression under the chosen conditions. However, care should be taken in introducing genes into contiguous, highly conserved spans, such as the Flank zone in GPA. Although less obvious than the origin of replication, such areas are almost entirely inaccessible to HGT over evolutionary time, and likely benefit from a highly streamlined, superoperonic architecture (Lathe, Snel, and Bork 2000). If the organism finds it hard to make a novel gene or operon “work” within such a zone, even in the long term, then synthetic introduction of cassettes are likely to have latent negative fitness effects that may not be clear immediately, but could result in compensatory mutations to frustrate expression in the medium term. At the very least, the identification of HGT-permissive zones within the GPA genome, and the tentative conservation of this architecture in B. subtilis (Figure 4.5B, top panel), reveals long stretches of real estate that should be accepting of synthetic constructs. What of compartment-specific heterologous gene expression, facilitated by the presence of two HGT-accepting zones? Many recent spore-related bioengineering efforts have been aimed at the use of spores as biosensors (Pan et al. 2014; Wu et al. 2015, also see Chen, Ullah, and Jia 2017) and this involves the display of proteins on the surface of the spore. Although there is evidence for σF-driven contribution to spore coat formation, e.g.

Chapter 5 5.4 156 Chapter 5 DISCUSSION by the function of the spoIIQ gene (McKenney and Eichenberger 2012; Wang et al. 2006b), most components destined for spore coat-integration are under the control of mother cell-specific σE and σK regulators (Eichenberger et al. 2003). Thus, the existence of an HGT-permissive zone useful for staging forespore-specific expression may currently be of limited use in the still developing field of spore biotechnology. Nevertheless, as our understanding of the processes driving early spore formation ad- vances, especially in the context of divergent sporulation programs in non-B. subtilis taxa (such as GPA) there may be a need to augment forespore-specific processes. Alternatively, in cases where spore-forming organisms are used as a chassis, but sporulation can be disabled, valuable origin-proximal genomic real estate could be used for engineering. The topological biases moulded by evolution, and revealed in this work, may provide a crucial stepping stone in such future attempts.

5.5 Future directions

5.5.1 HGT domain topology across prokaryotes

One clear extension of the work presented in this thesis is to leverage the clade-targeted HGT inference methods described in Chapter 2 to probe the topology of HT genes in other taxonomic groups, possibly other emerging chassis for synthetic biology. In the first instance, this could be applied to an interesting question raised by the link between HGT topology and the endospore formation in Chapter 4: does the sporulation program constrain domain architecture? To test this in an evolutionary context, I propose identifying HT gene distribution in pairs of closely related clades - each pair would comprise a group that can form endospores and a group in which the ability to sporulate has been lost. In this way, I could directly test whether the loss of sporulation has led to changes in genome organisation, preferably in multiple pairwise comparisons to properly interrogate the link between HGT zones and sporulation. The Firmicute phylum is the only place to

Chapter 5 5.5 157 Chapter 5 DISCUSSION

find such groups. Given the very broad distribution of sporulating organisms across the Firmicutes, the program is considered to be ancestral to the phylum (Angert, Hutchison, and Miller 2014). Yet, there are many examples of sporulation loss; indeed, in all the groups considered as non-spore-formers in my analysis in Chapter 4 (e.g. the order Lactobacillales, the family Listeriaceae). A good first choice for a pairwise comparison of HT gene topology would be within the Clostridia class, which contains the spore-forming genus Clostridium that is also considered as an alternative bioengineering chassis (Kim et al. 2016). While still within the Firmicutes, Clostridia are highly evolutionarily diverged from the GPA clade, with a recent conservative estimate of a common ancestor at ~2.5 billion years ago (Marin et al. 2017). Yet, the sigma factor cascade involved in sporulation, as well as general morphogenesis, is quite highly conserved with B. subtilis (Pereira et al. 2013; Saujet et al. 2013). Another advantage, as it was for GPA in this work, is that Clostridia mainly undergo symmetric rearrangements (Repar and Warnecke 2017), simplifying pattern analysis. Thus, if sporulation is driving the HGT-rich domain distribution, I would expect to find homologous patterning in Clostridium spp.. The analysis would then be repeated for a closely related non-spore-forming group, e.g. genomes of the Eubacterium genus that are non-spore-forming despite their large size and close relation to Clostridium (Galperin 2013; Galperin et al. 2012). Results would either implicate sporulation as the guiding force in HGT- domain organisation or, possibly, reveal other underlying forces constraining genome topology. As a further extension, I would propose to similarly investigate HT gene topology in a far-removed prokaryotic group, within the Archaea. HGT has been implicated as a driving factor in archaeal gene innovation and origin of new groups (Nelson-Sathi et al. 2012, 2015), and the authors suggest that this gene acquisi- tion is asymmetric, with bacterial donors providing material to archaea much more frequently than vice versa. Interestingly, this observation has been challenged in

Chapter 5 5.5 158 Chapter 5 DISCUSSION the context of thermophilic archaea where archaea-to-bacteria HGT was found to be much more frequent (Fuchsman et al. 2017). Whichever case, archaea would certainly have to deal with some proportion of bacterial genes and integration from such divergent donors may be subject to elevated barriers at every level. It would be interesting to see to what extent HGT topology is shaped by genome organisation in an archaeal group, and such an analysis is becoming more tractable with increasing availability of completely assembled genomes.

5.5.2 Spore-specific genetic innovation in GPA clade

There is also potential to further investigate the role of the origin-proximal domain as a harbour for innovation of the early sporulation program in GPA. When comparing the suite of genes under the control of the σF regulator in B. subtilis and G. kaustophilus, I found that only 31 of the 48 B. subtilis genes were conserved in all GPA genomes - indicating their importance in the core sporulation program. The loss of 17 genes in some, and in 5 cases all, GPA genomes suggests room for gene turnover even at this stage of the program and so I would expect to find at least a few candidate HT genes gained during GPA evolution that might be under the control of σF. If my model of HT gene topology being driven by compartment- specific expression is valid, I should find these candidate genes within the Near Origin HGT-permissive zone. I would identify putative σF regulons by searching for σF promoter se- quences. Given the high conservation of sporulation regulation amongst spore- formers, I would use the consensus binding sequence from B. subtilis (Sierro et al. 2008). Correct identification of putative promoters could be evaluated by detect- ing downstream genes that are known GPA orthologues of B. subtilis σF-activated genes (i.e. the 31 genes mentioned above). I would then select candidates repre- senting novel σF-regulated genes that are predicted to have been acquired via HGT, according to my HGT inference methods. At this stage, the chromosomal location

Chapter 5 5.5 159 Chapter 5 DISCUSSION of these candidates should give the first indication of the importance of the Near Origin zone for augmenting early spore formation in GPA. To confirm the role of candidates in the first stages of spore formation, compartment-specific expression could be determined by performing targeted RNA fluorescence in situ hybridization (FISH) on a model GPA species, such as G. kaustophilus, during spore formation. This analysis could be further extended by determining HGT acquisitions contributing to spore formation at different stages and with different functions, e.g. innovation of the spore coat by expressing HT genes under the control of the late mother cell-specific factor σK. Comprehensive dissection of gene flow at various stages of sporulation should further elucidate spatial biases associated with HT genes. Further, looking at what functions were gained, and when, could help build a timeline for the augmentation of spore formation in GPA. Such work may provide the first clues to the extraordinary survival of these biological ‘lifeboats’ in the most varied and adverse environments on the planet.

Chapter 5 5.5 160 References

Adams, Bryn L. (2016). “The Next Generation of Synthetic Biology Chassis: Moving Synthetic Biology from the Laboratory to the Field”. In: ACS Synthetic Biology 5.12, pp. 1328–1330. Agostinelli, C. and U. Lund (2017). R package circular: Circular Statistics (version 0.4-93). CA: Department of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University, Venice, Italy. UL: Department of Statistics, California Polytechnic State University, San Luis Obispo, California, USA. Aliyu, Habibu et al. (2016). “Phylogenomic re-assessment of the thermophilic genus Geobacillus”. In: Systematic and Applied Microbiology 39.8, pp. 527–533. Alkhalili, Rawana N. et al. (2016). “Antimicrobial protein candidates from the ther- mophilic Geobacillus sp. Strain ZGt-1: Production, proteomics, and bioinformat- ics analysis”. In: International Journal of Molecular Sciences 17.8. Altschul, Stephen F. et al. (1990). “Basic local alignment search tool”. In: Journal of Molecular Biology 215.3, pp. 403–410. Amorós-Moya, Dolors et al. (2010). “Evolution in regulatory regions rapidly compen- sates the cost of nonoptimal codon usage”. In: Molecular Biology and Evolution 27.9, pp. 2141–2151. Angert, Esther R., Elizabeth A. Hutchison, and David A. Miller (2014). “Sporula- tion in Bacteria: Beyond the Standard Model”. In: The Bacterial Spore: from Molecules to Systems, pp. 87–102.

161 Asano, Yasuhisa and Yasuo Kato (1998). “Z-phenylacetaldoxime degradation by a novel aldoxime dehydratase from Bacillus sp. strain OxB-1”. In: FEMS Microbi- ology Letters 158.2, pp. 185–190. Bacon, Leann F. et al. (2017). “Development of an efficient technique for gene dele- tion and allelic exchange in Geobacillus spp.” In: Microbial Cell Factories 16.1, p. 58. Bansal, Mukul S., J. Peter Gogarten, and Ron Shamir (2010). “Detecting Highways of Horizontal Gene Transfer”. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinfor-

matics). Vol. 6398 LNBI. Springer, Berlin, Heidelberg, pp. 109–120. Barák, Imrich and Katarína Muchová (2018). “The positioning of the asymmetric septum during sporulation in .” In: PloS one 13.8. Ed. by Eric Cascales. Barlow, Miriam and Barry G. Hall (2002). “Phylogenetic analysis shows that the OXA β-lactamase genes have been on plasmids for millions of years”. In: Journal of Molecular Evolution 55.3, pp. 314–321. Beiko, R. G., T. J. Harlow, and M. A. Ragan (2005). “Highways of gene sharing in prokaryotes”. In: Proceedings of the National Academy of Sciences 102.40, pp. 14332–14337. Belduz, Ali Osman, Sabriye Dulger, and Zihni Demirbag (2003). “Anoxybacillus go- nensis sp. nov., a moderately thermophilic, xylose-utilizing, endospore-forming bacterium”. In: International Journal of Systematic and Evolutionary Microbiol- ogy 53.5, pp. 1315–1320. Ben-Yehuda, Sigal et al. (2005). “Defining a centromere-like element in Bacillus subtilis by identifying the binding sites for the chromosome-anchoring protein RacA”. In: Molecular Cell 17.6, pp. 773–782. Berne, Cécile et al. (2015). “Adhesins Involved in Attachment to Abiotic Surfaces by Gram-Negative Bacteria”. In: Microbiology Spectrum 3.4.

162 Bertelli, Claire et al. (2017). “IslandViewer 4: Expanded prediction of genomic is- lands for larger-scale datasets”. In: Nucleic Acids Research 45.W1, W30–W35. Bezuidt, Oliver K. et al. (2016). “The Geobacillus pan-genome: Implications for the evolution of the genus”. In: Frontiers in Microbiology 7.05. Borkowski, Olivier et al. (2016). “Translation elicits a growth rate-dependent, genome-wide, differential protein production in Bacillus subtilis”. In: Molecular Systems Biology 12.5, p. 870. Bosma, Elleke F. et al. (2016). “Complete genome sequence of thermophilic Bacillus smithii type strain DSM 4216T”. In: Standards in Genomic Sciences 11.1. Boucher, Yan et al. (2003). “Lateral Gene Transfer and the Origins of Prokaryotic Groups”. In: Annual Review of Genetics 37.1, pp. 283–328. Briggs, Geoffrey S., Wiep Klaas Smits, and Panos Soultanas (2012). “Chromosomal replication initiation machinery of low-G+C-content firmicutes”. In: Journal of Bacteriology 194.19, pp. 5162–5170. Brumm, Phillip et al. (2015a). “Complete Genome Sequence of Geobacillus strain Y4.1MC1, a Novel CO-Utilizing Geobacillus thermoglucosidasius Strain Isolated from Bath Hot Spring in Yellowstone National Park”. In: Bioenergy Research 8.3, pp. 1039–1045. Brumm, Phillip et al. (2015b). “Complete genome sequences of Geobacillus sp. Y412MC52, a xylan-degrading strain isolated from obsidian hot spring in Yel- lowstone National Park”. In: Standards in Genomic Sciences 10.1. Brumm, Phillip J., Miriam L. Land, and David A. Mead (2016). “Complete genome sequences of Geobacillus sp. WCH70, a thermophilic strain isolated from wood compost”. In: Standards in Genomic Sciences 11.1. Brumm, Phillip J. et al. (2015c). “Genomic analysis of six new Geobacillus strains reveals highly conserved carbohydrate degradation architectures and strategies”. In: Frontiers in Microbiology 6.05, p. 430.

163 Bryant, Jack A. et al. (2014). “Chromosome position effects on gene expression in Escherichia coli K-12”. In: Nucleic Acids Research 42.18, pp. 11383–11392. Casacuberta, Elena and Josefa González (2013). “The impact of transposable ele- ments in environmental adaptation”. In: Molecular Ecology 22.6, pp. 1503–1517. Chan, Leon Y, Sriram Kosuri, and Drew Endy (2005). “Refactoring bacteriophage T7”. In: Molecular Systems Biology 1.1, pp. 1–10. Chen, Huayou, Jawad Ullah, and Jinru Jia (2017). “Progress in Bacillus subtilis Spore Surface Display Technology towards Environment, Vaccine Development, and Biocatalysis”. In: Journal of Molecular Microbiology and Biotechnology 27.3, pp. 159–167. Choulet, Frédéric et al. (2006). “Evolution of the terminal regions of the Strepto- myces linear chromosome”. In: Molecular Biology and Evolution 23.12, pp. 2361– 2369. Cohen, Ofir, Uri Gophna, and Tal Pupko (2011). “The complexity hypothesis revis- ited: Connectivity Rather Than function constitutes a barrier to horizontal gene transfer”. In: Molecular Biology and Evolution 28.4, pp. 1481–1489. Cooper, Vaughn S. et al. (2010). “Why genes evolve faster on secondary chromosomes in bacteria”. In: PLoS Computational Biology 6.4, p. 1000732. Coorevits, An et al. (2012). “Taxonomic revision of the genus Geobacillus: Emen- dation of Geobacillus, G. stearothermophilus, G. jurassicus, G. toebii, G. ther- modenitrificans and G. thermoglucosidans (nom. corrig., formerly ’thermoglu- cosidasius’); transfer of Bacillus thermantarcticus ”. In: International Journal of Systematic and Evolutionary Microbiology 62.7, pp. 1470–1485. Couturier, Etienne and Eduardo P C Rocha (2006). “Replication-associated gene dosage effects shape the genomes of fast-growing bacteria but only for transcrip- tion and translation genes”. In: Molecular Microbiology 59.5, pp. 1506–1518.

164 Cripps, R. E. et al. (2009). “Metabolic engineering of Geobacillus thermoglucosida- sius for high yield ethanol production”. In: Metabolic Engineering 11.6, pp. 398– 408. Danchin, Antoine (2012). “Scaling up synthetic biology: Do not forget the chassis”. In: FEBS Letters 586.15, pp. 2129–2137. Darmon, E. and D. R. F. Leach (2014). “Bacterial Genome Instability”. In: Micro- biology and Molecular Biology Reviews 78.1, pp. 1–39. Daubin, Vincent, Emmanuelle Lerat, and Guy Perrière (2003). “The source of lat- erally transferred genes in bacterial genomes.” In: Genome biology 4.9, R57. Daubin, Vincent and Howard Ochman (2004). “Bacterial genomes as new gene homes: The genealogy of ORFans in E. coli”. In: Genome Research 14.6, pp. 1036– 1042. David, Lawrence A. and Eric J. Alm (2011). “Rapid evolutionary innovation during an Archaean genetic expansion”. In: Nature 469.7328, pp. 93–96. Davids, Wagied and Zhaolei Zhang (2008). “The impact of horizontal gene transfer in shaping operons and protein interaction networks - Direct evidence of preferential attachment”. In: BMC Evolutionary Biology 8.1. Davidson, Ruth et al. (2015). “Phylogenomic species tree estimation in the presence of incomplete lineage sorting and horizontal gene transfer”. In: BMC Genomics 16.Suppl 10, pp. 1–12. De Hoon, Michiel J.L., Patrick Eichenberger, and Dennis Vitkup (2010). “Hierarchi- cal evolution of the bacterial sporulation network”. In: Current Biology 20.17. de Vries, Andrie and Brian D. Ripley (2016). ggdendro: Create Dendrograms and Tree Diagrams Using ’ggplot2’. R package version 0.1-20. DeMaere, M. Z. et al. (2013). “High level of intergenera gene exchange shapes the evolution of haloarchaea in an isolated Antarctic lake”. In: Proceedings of the National Academy of Sciences 110.42, pp. 16939–16944.

165 Deschamps, Philippe et al. (2014). “Pangenome evidence for extensive interdomain horizontal transfer affecting lineage coreandshell genes inuncultured planktonic thaumarchaeota and euryarchaeota”. In: Genome Biology and Evolution 6.7, pp. 1549–1563. Dilthey, Alexander and Martin J. Lercher (2015). “Horizontally transferred genes cluster spatially and metabolically”. In: Biology Direct 10.1, p. 72. Dobrindt, Ulrich et al. (2004). “Genomic islands in pathogenic and environmental microorganisms”. In: Nature Reviews Microbiology 2.5, pp. 414–424. Dorman, Charles J (2009). “Regulatory integration of horizontally-transferred genes in bacteria.” In: Frontiers in bioscience (Landmark edition) 14.7, pp. 4103–12. Doyon, Jean-Philippe et al. (2010). “An Efficient Algorithm for Gene/Species Trees Parsimonious Reconciliation with Losses, Duplications and Transfers”. In: Lec- ture Notes in Computer Science (including subseries Lecture Notes in Artificial

Intelligence and Lecture Notes in Bioinformatics). Vol. 6398 LNBI. Springer, Berlin, Heidelberg, pp. 93–108. Dunlap, Christopher A. et al. (2015). “Bacillus paralicheniformis sp. Nov., isolated from fermented soybean paste”. In: International Journal of Systematic and Evo- lutionary Microbiology 65.10, pp. 3487–3492. Edgar, Robert C. (2004). “MUSCLE: Multiple sequence alignment with high accu- racy and high throughput”. In: Nucleic Acids Research 32.5, pp. 1792–1797. Efron, B., E. Halloran, and S. Holmes (1996). “Bootstrap confidence levels for phy- logenetic trees”. In: Proceedings of the National Academy of Sciences 93.23, pp. 13429–13429. Eichenberger, Patrick et al. (2003). “The σE regulon and the identification of ad- ditional sporulation genes in Bacillus subtilis”. In: Journal of Molecular Biology 327.5, pp. 945–972.

166 Ellis, Michael J. and David B. Haniford (2016). “Riboregulation of bacterial and archaeal transposition”. In: Wiley Interdisciplinary Reviews: RNA 7.3, pp. 382– 398. Enright, A J, S Van Dongen, and C A Ouzounis (2002). “An efficient algorithm for large-scale detection of protein families”. In: Nucleic Acids Research 30.7, pp. 1575–1584. Felsenstein, Joseph (1985). “Confidence Limits on Phylogenies: An Approach Using the Bootstrap”. In: Evolution 39.4, p. 783. Ferrándiz, María-José et al. (2014). “Role of Global and Local Topology in the Regulation of Gene Expression in Streptococcus pneumoniae”. In: PLoS ONE 9.7, e101574. Filippidou, Sevasti et al. (2016). “Anoxybacillus geothermalis sp. nov., a faculta- tively anaerobic, endospore-forming bacterium isolated from mineral deposits in a geothermal station”. In: International Journal of Systematic and Evolutionary Microbiology 66.8, pp. 2944–2951. Frandsen, Niels et al. (1999). “Transient gene asymmetry during sporulation and establishment of cell specificity in Bacillus subtilis”. In: Genes and Development 13.4, pp. 394–399. Friehs, Karl (2004). “Plasmid copy number and plasmid stability.” In: Advances in biochemical engineering/biotechnology 86, pp. 47–82. Frost, Laura S. et al. (2005). “Mobile genetic elements: The agents of open source evolution”. In: Nature Reviews Microbiology 3.9, pp. 722–732. Fuchsman, Clara A. et al. (2017). “Effect of the environment on horizontal gene transfer between bacteria and archaea”. In: PeerJ 5, e3865. Galperin, Michael Y. (2013). “Genome Diversity of Spore-Forming Firmicutes”. In: Microbiology Spectrum 1.2.

167 Galperin, Michael Y. et al. (2012). “Genomic determinants of sporulation in Bacilli and Clostridia: Towards the minimal set of sporulation-specific genes”. In: Envi- ronmental Microbiology 14.11, pp. 2870–2890. Galtier, Nicolas and Vincent Daubin (2008). “Dealing with incongruence in phyloge- nomic analyses”. In: Philosophical Transactions of the Royal Society B: Biological Sciences 363.1512, pp. 4023–4029. Gao, F., H. Luo, and C.-T. Zhang (2013). “DoriC 5.0: an updated database of oriC regions in both bacterial and archaeal genomes”. In: Nucleic Acids Research 41.D1, pp. D90–D93. Gogarten, J. Peter, W. Ford Doolittle, and Jeffrey G. Lawrence (2002). “Prokaryotic evolution in light of gene transfer”. In: Molecular Biology and Evolution 19.12, pp. 2226–2238. Gogarten, J. Peter and Jeffrey P. Townsend (2005). “Horizontal gene transfer, genome innovation and evolution”. In: Nature Reviews Microbiology 3.9, pp. 679–687. Goodman, M. et al. (1979). “Fitting the Gene Lineage into its Species Lineage, a Par- simony Strategy Illustrated by Cladograms Constructed from Globin Sequences”. In: Systematic Biology 28.2, pp. 132–163. Groussin, Mathieu et al. (2016). “Gene acquisitions from bacteria at the origins of major archaeal clades are vastly overestimated”. In: Molecular Biology and Evolution 33.2, pp. 305–310. Gu, Pengfei et al. (2015). “A rapid and reliable strategy for chromosomal integration of gene(s) with multiple copies”. In: Scientific Reports 5.1, pp. 1–9. Gustafsson, Claes et al. (2012). “Engineering genes for predictable protein expres- sion”. In: Protein Expression and Purification 83.1, pp. 37–46. Hackathon, R. (2017). phylobase: Base Package for Phylogenetic Structures and Comparative Data. R package version 0.8.4.

168 Hao, Weilong and GB Golding (2006). “The fate of laterally transferred genes: Life in the fast lane to adaptation or death”. In: Genome Research 16.5, pp. 636–643. Harrell Jr, Frank E, with contributions from Charles Dupont, and many others. (2018). Hmisc: Harrell Miscellaneous. R package version 4.1-1. Hasegawa, Masami and Hirohisa Kishino (1989). “Confidence Limits On The Maximum-Likelihood Estimate Of The Hominoid Tree From Mitochondrial-Dna Sequences”. In: Evolution 43.3, pp. 672–677. Hayashi, Tetsuya (2001). “Complete Genome Sequence of Enterohemorrhagic Esche- lichia coli O157:H7 and Genomic Comparison with a Laboratory Strain K-12”. In: DNA Research 8.1, pp. 11–22. Hehemann, Jan Hendrik et al. (2016). “Adaptive radiation by waves of gene transfer leads to fine-scale resource partitioning in marine microbes”. In: Nature Commu- nications 7, pp. 1–10. Hendrickson, Heather and Jeffrey G. Lawrence (2006). “Selection for chromosome architecture in bacteria”. In: Journal of Molecular Evolution 62.5, pp. 615–629. Hendrickson, Heather L. et al. (2018). “Chromosome architecture constrains hori- zontal gene transfer in bacteria”. In: PLoS Genetics 14.5. Hershberg, Ruth and Dmitri A. Petrov (2010). “Evidence that mutation is univer- sally biased towards AT in bacteria”. In: PLoS Genetics 6.9, pp. 1–13. Hilbert, D. W. and P. J. Piggot (2004). “Compartmentalization of Gene Expression during Bacillus subtilis Spore Formation”. In: Microbiology and Molecular Biology Reviews 68.2, pp. 234–262. Hildebrand, Falk, Axel Meyer, and Adam Eyre-Walker (2010). “Evidence of selection upon genomic GC-content in bacteria”. In: PLoS Genetics 6.9. Ed. by Michael W. Nachman, e1001107. Hobbs, Joanne K. et al. (2012). “On the origin and evolution of thermophily: Re- construction of functional precambrian enzymes from ancestors of Bacillus”. In: Molecular Biology and Evolution 29.2, pp. 825–835.

169 Hooper, Sean D and Otto G Berg (2003). “Duplication is more common among laterally transferred genes than among indigenous genes.” In: Genome biology 4.8, R48. Huerta-Cepas, Jaime et al. (2016). “EGGNOG 4.5: A hierarchical orthology frame- work with improved functional annotations for eukaryotic, prokaryotic and viral sequences”. In: Nucleic Acids Research 44.D1, pp. D286–D293. Imperial College Research Computing Service (2018). 10.14469/hpc/2232. Itoh, Takeshi et al. (1999). “Evolutionary instability of operon structures disclosed by sequence comparisons of complete microbial genomes”. In: Molecular Biology and Evolution 16.3, pp. 332–346. Jain, R., M. C. Rivera, and J. A. Lake (1999). “Horizontal gene transfer among genomes: The complexity hypothesis”. In: Proceedings of the National Academy of Sciences 96.7, pp. 3801–3806. Johnson, Angus and Adrian Baddeley (2017). polyclip: Polygon Clipping. R package version 1.6-1. Juhas, Mario et al. (2009). “Genomic islands: Tools of bacterial horizontal gene transfer and evolution”. In: FEMS Microbiology Reviews 33.2, pp. 376–393. Kamneva, Olga K. and Naomi L. Ward (2014). “Reconciliation approaches to deter- mining HGT, duplications, and losses in gene trees”. In: Methods in Microbiology 41, pp. 183–199. Kanehisa, Minoru et al. (2017). “KEGG: New perspectives on genomes, pathways, diseases and drugs”. In: Nucleic Acids Research 45.D1, pp. D353–D361. Karlin, S and C Burge (1995). “Dinucleotide relative abundance extremes: a genomic signature”. In: Trends in Genetics 11.7, pp. 283–290. Karlin, Samuel (2001). “Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes”. In: Trends in Microbiology 9.7, pp. 335–343. Kassambara, Alboukadel (2017). ggpubr: ’ggplot2’ Based Publication Ready Plots.R package version 0.1.6.

170 Kato, Jun Ichi and Masayuki Hashimoto (2007). “Construction of consecutive dele- tions of the Escherichia coli chromosome”. In: Molecular Systems Biology 3. Kim, Juhyun et al. (2016). “Properties of alternative microbial hosts used in syn- thetic biology: towards the design of a modular chassis”. In: Essays In Biochem- istry 60.4, pp. 303–313. Kim, Junehyung and Wolfgang Schumann (2009). “Display of proteins on bacillus subtilis endospores”. In: Cellular and Molecular Life Sciences 66.19, pp. 3127– 3136. Kim, Soo Jin et al. (2015). “Bacillus glycinifermentans sp. Nov., isolated from fer- mented soybean paste”. In: International Journal of Systematic and Evolutionary Microbiology 65.10, pp. 3586–3590. Kishino, Hirohisa and Masami Hasegawa (1989). “Evaluation of the maximum likeli- hood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea”. In: Journal of Molecular Evolution 29.2, pp. 170–179. Koehler, Theresa M. (2009). “Bacillus anthracis physiology and genetics”. In: Molec- ular Aspects of Medicine 30.6, pp. 386–396. Konuray, Gözde and Zerrin Erginkaya (2018). “Potential Use of Bacillus coagulans in the Food Industry”. In: Foods 7.6, p. 92. Koonin, Eugene V. and Yuri I. Wolf (2012). “Evolution of microbes and viruses: a paradigm shift in evolutionary biology?” In: Frontiers in Cellular and Infection Microbiology 2. Kudla, Grzegorz et al. (2009). “Coding-sequence determinants of expression in es- cherichia coli”. In: Science 324.5924, pp. 255–258. Labroussaa, Fabien et al. (2016). “Impact of donor-recipient phylogenetic distance on bacterial genome transplantation”. In: Nucleic Acids Research 44.17, pp. 8501– 8511.

171 Lartigue, Carole et al. (2007). “Genome transplantation in bacteria: Changing one species to another”. In: Science 317.5838, pp. 632–638. Lassalle, Florent et al. (2015). “GC-Content Evolution in Bacterial Genomes: The Biased Gene Conversion Hypothesis Expands”. In: PLoS Genetics 11.2. Ed. by Dmitri A. Petrov, pp. 1–20. Lathe, Warren C., Berend Snel, and Peer Bork (2000). “Gene context conservation of a higher order than operons”. In: Trends in Biochemical Sciences 25.10, pp. 474– 479. Lawrence, J. G. and H. Ochman (1998). “Molecular archaeology of the Escherichia coli genome”. In: Proceedings of the National Academy of Sciences 95.16, pp. 9413–9417. Lawrence, Jeffrey G. and Howard Ochman (1997). “Amelioration of bacterial genomes: Rates of change and exchange”. In: Journal of Molecular Evolution 44.4, pp. 383–397. Lawrence, Jeffrey G. and John R. Roth (1996). “Selfish operons: horizontal transfer may drive the evolution of gene clusters.” In: Genetics 143.4, pp. 1843–60. Le Fourn, Céline et al. (2011). “An oxygen reduction chain in the hyperthermophilic anaerobe Thermotoga maritima highlights horizontal gene transfer between Thermococcales and Thermotogales”. In: Environmental Microbiology 13.8, pp. 2132–2145. Lercher, Martin J. and Csaba Pál (2008). “Integration of horizontally transferred genes into regulatory interaction networks takes many million years”. In: Molec- ular Biology and Evolution 25.3, pp. 559–567. Li, L., Christian J. Stoeckert, and David S. Roos (2003). “OrthoMCL: Identification of ortholog groups for eukaryotic genomes”. In: Genome Research 13.9, pp. 2178– 2189. Lind, Peter A. et al. (2010). “Compensatory gene amplification restores fitness after inter-species gene replacements”. In: Molecular Microbiology 75.5, pp. 1078–1089.

172 Lockhart, Peter J., David Penny, and Axel Meyer (1995). “Testing the phylogeny of swordtail fishes using split decomposition and spectral analysis”. In: Journal of Molecular Evolution 41.5, pp. 666–674. Lopez-Garrido, Javier et al. (2018). “Chromosome Translocation Inflates Bacillus Forespores and Impacts Cellular Morphology”. In: Cell 172.4, 758–770.e14. Lorenzo, Víctor de and Antoine Danchin (2008). “Synthetic biology: Discovering new worlds and new words. The new and not so new aspects of this emerging research field”. In: EMBO Reports 9.9, pp. 822–827. Losick, Richard and Patrick Stragier (1992). “Crisscross regulation of cell-type- specific gene expression during development in B. subtilis”. In: 355.6361, pp. 601– 604. Marchant, Roger et al. (2002). “The frequency and characteristics of highly ther- mophilic bacteria in cool soil environments”. In: Environmental Microbiology 4.10, pp. 595–602. Marin, Julie et al. (2017). “The Timetree of Prokaryotes: New Insights into Their Evolution and Speciation”. In: Molecular biology and evolution 34.2, pp. 437–446. Mayilraj, Shanmugam and Erko Stackebrandt (2013). “The family Paenibacillaceae”. In: The Prokaryotes: Firmicutes and Tenericutes. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 267–280. McCarthy, Alex J. et al. (2014). “Extensive horizontal gene transfer during Staphy- lococcus aureus co-colonization in vivo”. In: Genome Biology and Evolution 6.10, pp. 2697–2708. McCutcheon, John P. and Nancy A. Moran (2010). “Functional convergence in re- duced genomes of bacterial symbionts spanning 200 my of evolution”. In: Genome Biology and Evolution 2.1, pp. 708–718. McCutcheon, John P. and Nancy A. Moran (2012). “Extreme genome reduction in symbiotic bacteria”. In: Nature Reviews Microbiology 10.1, pp. 13–26.

173 McKenney, Peter T. and Patrick Eichenberger (2012). “Dynamics of spore coat mor- phogenesis in Bacillus subtilis”. In: Molecular Microbiology 83.2, pp. 245–260. McMullan, G. et al. (2004). “Habitat, applications and genomics of the aerobic, ther- mophilic genus Geobacillus”. In: Biochemical Society Transactions 32.2, pp. 214– 217. Mead, David A. et al. (2012). “Complete genome sequence of Paenibacillus strain Y4.12MC10, a novel Paenibacillus lautus strain isolated from obsidian hot spring in yellowstone national park”. In: Standards in Genomic Sciences 6.3, pp. 366– 385. Médigue, C. et al. (1991). “Evidence for horizontal gene transfer in Escherichia coli speciation”. In: Journal of Molecular Biology 222.4, pp. 851–856. Medrano-Soto, Arturo et al. (2004). “Successful lateral transfer requires codon us- age compatibility between foreign genes and recipient genomes”. In: Molecular Biology and Evolution 21.10, pp. 1884–1894. Meintanis, Christos et al. (2006). “Biodegradation of crude oil by thermophilic bac- teria isolated from a volcano island”. In: Biodegradation 17.2, pp. 105–111. Merkle, Daniel, Martin Middendorf, and Nicolas Wieseke (2010). “A parameter- adaptive dynamic programming approach for inferring cophylogenies”. In: BMC Bioinformatics 11.Sup.1, S60. Mira, Alex, Howard Ochman, and Nancy A. Moran (2001). “Deletional bias and the evolution of bacterial genomes”. In: Trends in Genetics 17.10, pp. 589–596. Mirarab, Siavash and Tandy Warnow (2015). “ASTRAL-II: Coalescent-based species tree estimation with many hundreds of taxa and thousands of genes”. In: Bioin- formatics 31.12, pp. 44–52. Mirkin, Ekaterina V and Sergei M Mirkin (2005). “Mechanisms of transcription- replication collisions in bacteria.” In: Molecular and cellular biology 25.3, pp. 888– 895.

174 Mongodin, E. F. et al. (2005). “The genome of Salinibacter ruber: Convergence and gene exchange among hyperhalophilic bacteria and archaea”. In: Proceedings of the National Academy of Sciences 102.50, pp. 18147–18152. Moszer, Ivan, Eduardo Pc Rocha, and Antoine Danchin (1999). “Codon usage and lateral gene transfer in Bacillus subtilis”. In: Current Opinion in Microbiology 2.5, pp. 524–528. Mott, Melissa L. and James M. Berger (2007). “DNA replication initiation: Mecha- nisms and regulation in bacteria”. In: Nature Reviews Microbiology 5.5, pp. 343– 354. Musto, Héctor et al. (2004). “Correlations between genomic GC levels and optimal growth temperatures in prokaryotes.” In: FEBS letters 573.1-3, pp. 73–77. Muto, A. and S. Osawa (1987). “The guanine and cytosine content of genomic DNA and bacterial evolution.” In: Proceedings of the National Academy of Sciences 84.1, pp. 166–169. Müller, Kirill et al. (2017). RSQLite: ’SQLite’ Interface for R. R package version 2.0. Nazina, T. N. et al. (2001). “Taxonomic study of aerobic thermophilic bacilli: descrip- tions of Geobacillus subterraneus gen. nov., sp. nov. and Geobacillus uzenensis sp. nov. from petroleum reservoirs and transfer of Bacillus stearothermophilus, Bacillus thermo- catenulatus, Bacillus thermoleovorans, Bacillus kaustophilus, Bacillus thermoglucosidasius and Bacillus thermodenitrificans to Geobacillus as the new combinations G. stearothermophilus, G. thermocatenulatus, G. ther- moleovorans, G. kaustophilus, G. thermoglucosidasius and G. thermodenitrif- icans”. In: International Journal of Systematic and Evolutionary Microbiology 51.2, pp. 433–446. Nelson, Karen E. et al. (1999). “Evidence for lateral gene transfer between ar- chaea and bacteria from genome sequence of Thermotoga maritima”. In: Nature 399.6734, pp. 323–329.

175 Nelson-Sathi, Shijulal et al. (2012). “Acquisition of 1,000 eubacterial genes physiolog- ically transformed a methanogen at the origin of Haloarchaea.” In: Proceedings of the National Academy of Sciences of the United States of America 109.50, pp. 20537–20542. Nelson-Sathi, Shijulal et al. (2015). “Origins of major archaeal clades correspond to gene acquisitions from bacteria”. In: Nature 517.7532, pp. 77–80. Nguyen, Thi Hau et al. (2012). “Accounting for gene tree uncertainties improves gene trees and reconciliation inference”. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in

Bioinformatics). Vol. 7534 LNBI. Springer, Berlin, Heidelberg, pp. 123–134. Nguyen, Thi Hau et al. (2013). “Reconciliation and local gene tree rearrangement can be of mutual profit”. In: Algorithms for Molecular Biology 8.1, p. 12. Ogata, Hiroyuki et al. (1999). “KEGG: Kyoto Encyclopedia of Genes and Genomes”. In: Nucleic Acids Research 27.1, pp. 29–34. Ogden, T. Heath and Michael S. Rosenberg (2006). “Multiple sequence alignment accuracy and phylogenetic inference”. In: Systematic Biology 55.2. Ed. by Rod Page, pp. 314–328. Oliveira, Pedro H. et al. (2017). “The chromosomal organization of horizontal gene transfer in bacteria”. In: Nature Communications 8.1, p. 841. Oren, Yaara et al. (2014). “Transfer of noncoding DNA drives regulatory rewiring in bacteria”. In: Proceedings of the National Academy of Sciences 111.45, pp. 16112– 16117. Page, R D M (1994). “Maps Between Trees and Cladistic Analysis of Historical Associations among Genes, Organisms, and Areas”. In: Systematic Biology 43.1, pp. 58–77. Pagès, H. et al. (2017). Biostrings: Efficient manipulation of biological strings.R package version 2.46.0.

176 Pan, Jae-Gu et al. (2014). “Display of native proteins on Bacillus subtilis spores”. In: FEMS Microbiology Letters 358.2, pp. 209–217. Paradis, E., J. Claude, and K. Strimmer (2004). “APE: analyses of phylogenetics and evolution in R language”. In: Bioinformatics 20. R package version 5.0, pp. 289– 290. Pereira, Fátima C. et al. (2013). “The Spore Differentiation Pathway in the Enteric Pathogen Clostridium difficile”. In: PLoS Genetics 9.10, p. 1003782. Pinzón-Martínez, D. L. et al. (2010). “Thermophilic bacteria from Mexican thermal environments: Isolation and potential applications”. In: Environmental Technol- ogy 31.8-9, pp. 957–966. Pogrebnyakov, Ivan, Christian Bille Jendresen, and Alex Toftgaard Nielsen (2017). “Genetic toolbox for controlled expression of functional proteins in Geobacillus spp.” In: PLoS ONE 12.2. Poli, Annarita et al. (2006). “Anoxybacillus amylolyticus sp. nov., a thermophilic amylase producing bacterium isolated from Mount Rittmann (Antarctica)”. In: Systematic and Applied Microbiology 29.4, pp. 300–307. Popa, Ovidiu and Tal Dagan (2011). “Trends and barriers to lateral gene transfer in prokaryotes”. In: Current Opinion in Microbiology 14.5, pp. 615–623. Popa, Ovidiu et al. (2011). “Directed networks reveal genomic barriers and DNA repair bypasses to lateral gene transfer among prokaryotes”. In: Genome Research 21.4, pp. 599–609. Price, Morgan N., Paramvir S. Dehal, and Adam P. Arkin (2010). “FastTree 2 - Approximately maximum-likelihood trees for large alignments”. In: PLoS ONE 5.3. Ed. by Art F. Y. Poon, e9490. Puigbò, Pere, Yuri I. Wolf, and Eugene V. Koonin (2010). “The tree and net com- ponents of prokaryote evolution.” In: Genome biology and evolution 2, pp. 745– 756.

177 Puigbò, Pere et al. (2014). “Genomes in turmoil: quantification of genome dynamics in prokaryote supergenomes.” In: BMC biology 12.1, p. 66. Raghavan, R., Y. D. Kelkar, and H. Ochman (2012). “A selective force favoring in- creased G+C content in bacterial genes”. In: Proceedings of the National Academy of Sciences 109.36, pp. 14504–14507. Ram, Karthik and Hadley Wickham (2018). wesanderson: A Wes Anderson Palette Generator. R package version 0.3.6.9000. Rastogi, Gurdeep et al. (2009). “Isolation and characterization of cellulose-degrading bacteria from the deep subsurface of the Homestake gold mine, Lead, South Dakota, USA”. In: Journal of Industrial Microbiology and Biotechnology 36.4, pp. 585–598. Ravenhall, Matt et al. (2015). “Inferring Horizontal Gene Transfer”. In: PLoS Com- putational Biology 11.5, pp. 1–16. Reeve, Benjamin et al. (2016). “The Geobacillus Plasmid Set: A Modular Toolkit for Thermophile Engineering”. In: ACS Synthetic Biology 5.12, pp. 1342–1347. Repar, Jelena and Tobias Warnecke (2017). “Non-random inversion landscapes in prokaryotic genomes are shaped by heterogeneous selection pressures”. In: Molec- ular Biology and Evolution 34.8, pp. 1902–1911. Retchless, A. C. and J. G. Lawrence (2010). “Phylogenetic incongruence arising from fragmented speciation in enteric bacteria”. In: Proceedings of the National Academy of Sciences 107.25, pp. 11453–11458. Revell, Liam J. (2012). “phytools: An R package for phylogenetic comparative bi- ology (and other things).” In: Methods in Ecology and Evolution 3. R package version 0.6-44, pp. 217–223. Rivera, M. C. et al. (1998). “Genomic evidence for two functionally distinct gene classes”. In: Proceedings of the National Academy of Sciences 95.11, pp. 6239– 6244.

178 Rocha, Eduardo P C and Antoine Danchin (2003a). “Essentiality, not expressiveness, drives gene-strand bias in bacteria”. In: Nature Genetics 34.4, pp. 377–378. Rocha, Eduardo P C, Marie Touchon, and Edward J. Feil (2006). “Similar com- positional biases are caused by very different mutational effects”. In: Genome Research 16.12, pp. 1537–1547. Rocha, Eduardo P.C. (2004). “The replication-related organization of bacterial genomes”. In: Microbiology 150.6, pp. 1609–1627. Rocha, Eduardo P.C. (2008). “The Organization of the Bacterial Genome”. In: An- nual Review of Genetics 42.1, pp. 211–233. Rocha, Eduardo P.C. and Antoine Danchin (2003b). “Gene essentiality determines chromosome organisation in bacteria”. In: Nucleic Acids Research 31.22, pp. 6570–6577. Ruiz-García, Cristina et al. (2005). “Bacillus velezensis sp. nov., a surfactant- producing bacterium isolated from the river Vélez in Málaga, southern Spain”. In: International Journal of Systematic and Evolutionary Microbiology 55.1, pp. 191–195. Saujet, Laure et al. (2013). “Genome-Wide Analysis of Cell Type-Specific Gene Transcription during Spore Formation in Clostridium difficile”. In: PLoS Genetics 9.10, p. 1003756. Seale, R. Brent et al. (2012). “Genotyping of present-day and historical geobacillus species isolates from milk powders by high-resolution melt analysis of multiple variable-number tandem-repeat loci”. In: Applied and Environmental Microbiol- ogy 78.19, pp. 7090–7097. Sevenier, V. et al. (2012). “Prevalence of Clostridium botulinum and thermophilic heat-resistant spores in raw carrots and green beans used in French canning industry”. In: International Journal of Food Microbiology 155.3, pp. 263–268.

179 Sheng, Lili et al. (2017). “Development and implementation of rapid metabolic engi- neering tools for chemical and fuel production in Geobacillus thermoglucosidasius NCIMB 11955”. In: Biotechnology for Biofuels 10.1, p. 5. Shimodaira, Hidetoshi (2002). “An approximately unbiased test of phylogenetic tree selection”. In: Systematic Biology 51.3, pp. 492–508. Shimodaira, Hidetoshi and Masami Hasegawa (1999). “Multiple comparisons of log- likelihoods with applications to phylogenetic inference”. In: Molecular Biology and Evolution 16.8, pp. 1114–1116. Shintani, Masaki et al. (2014). “Complete Genome Sequence of the Thermophilic Polychlorinated Biphenyl Degrader Geobacillus sp. Strain JF8 (NBRC 109937).” In: Genome announcements 2.1, pp. 1213–1226. Shulami, Smadar et al. (2011). “The L-arabinan utilization system of Geobacillus stearothermophilus”. In: Journal of Bacteriology 193.11, pp. 2838–2850. Sierro, Nicolas et al. (2008). “DBTBS: A database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information”. In: Nucleic Acids Research 36.Sup.1, pp. 93–96. Sievers, Fabian et al. (2011). “Fast, scalable generation of high-quality protein mul- tiple sequence alignments using Clustal Omega”. In: Molecular Systems Biology 7.1, p. 539. Skippington, Elizabeth and Mark A. Ragan (2012). “Phylogeny rather than ecology or lifestyle biases the construction of Escherichia coli-Shigella genetic exchange communities”. In: Open Biology 2.09, p. 120112. Smillie, Chris S. et al. (2011). “Ecology drives a global network of gene exchange connecting the human microbiome”. In: Nature 480.7376, pp. 241–244. Snel, Berend, Peer Bork, and Martijn A. Huynen (2002). “Genomes in flux: The evolution of Archaeal and Proteobacterial gene content”. In: Genome Research 12.1, pp. 17–25.

180 Soler-Bistué, Alfonso et al. (2015). “Genomic Location of the Major Ribosomal Pro- tein Gene Locus Determines Vibrio cholerae Global Growth and Infectivity”. In: PLoS Genetics 11.4. Sorek, Rotem et al. (2007). “Genome-wide experimental determination of barriers to horizontal gene transfer”. In: Science 318.5855, pp. 1449–1452. Spencer, Paige S. et al. (2012). “Silent substitutions predictably alter translation elongation rates and protein folding efficiencies”. In: Journal of Molecular Biology 422.3, pp. 328–335. Srivatsan, Anjana et al. (2010). “Co-orientation of replication and transcription pre- serves genome integrity”. In: PLoS Genetics 6.1, e1000810. St-Pierre, François et al. (2013). “One-step cloning and chromosomal integration of DNA”. In: ACS Synthetic Biology 2.9, pp. 537–541. Stamatakis, Alexandros (2014). “RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies”. In: Bioinformatics 30.9, pp. 1312–1313. Studholme, David J. (2015). “Some (bacilli) like it hot: Genomics of Geobacillus species”. In: Microbial Biotechnology 8.1, pp. 40–48. Syvanen, M (1994). “Horizontal Gene Transfer: Evidence and Possible Conse- quences”. In: Annual Review of Genetics 28.1, pp. 237–261. Taboada, Blanca et al. (2018). “Operon-mapper: a web server for precise operon identification in bacterial and archaeal genomes”. In: Bioinformatics. Tak, Euon Jung et al. (2018). “Virgibacillus phasianinus sp. Nov., a halophilic bac- teriumisolated from faeces of a Swinhoe’s pheasant, Lophura swinhoii”. In: Inter- national Journal of Systematic and Evolutionary Microbiology 68.4, pp. 1190– 1196. Takaku, Hiroaki et al. (2006). “Microbial communities in the garbage composting with rice hull as an amendment revealed by culture-dependent and -independent approaches”. In: Journal of Bioscience and Bioengineering 101.1, pp. 42–50.

181 Takami, Hideto et al. (2004). “Thermoadaptation trait revealed by the genome se- quence of thermophilic Geobacillus kaustophilus”. In: Nucleic Acids Research 32.21, pp. 6292–6303. Tan, Ge et al. (2015). “Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference”. In: Systematic Biology 64.5, pp. 778–791. Tatusov, Roman L. et al. (2003). “The COG database: An updated vesion includes eukaryotes”. In: BMC Bioinformatics 4, p. 41. Thomas, Christopher M. and Kaare M. Nielsen (2005). Mechanisms of, and barriers to, horizontal gene transfer between bacteria. Thomas, Sara H. et al. (2008). “The mosaic genome of Anaeromyxobacter de- halogenans strain 2CP-C suggests an aerobic common ancestor to the delta- proteobacteria”. In: PLoS ONE 3.5. Ed. by Niyaz Ahmed, e2103. Tian, Ren Mao et al. (2015). “Rare events of intragenusand intraspecies horizontal transfer of the 16S rRNA gene”. In: Genome Biology and Evolution 7.8, pp. 2310– 2320. Tonini, João et al. (2015). “Concatenation and species tree methods exhibit statisti- cally indistinguishable accuracy under a range of simulated conditions”. In: PLoS Currents 7. Touchon, Marie, Louis Marie Bobay, and Eduardo P C Rocha (2014). “The chromo- somal accommodation and domestication of mobile genetic elements”. In: Current Opinion in Microbiology 22, pp. 22–29. Touchon, Marie et al. (2009). “Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths”. In: PLoS Genetics 5.1. Ed. by Josep Casadesús. Tourova, T. P. et al. (2008). “alkB homologs in thermophilic bacteria of the genus Geobacillus”. In: Molecular Biology 42.2, pp. 217–226.

182 Treangen, Todd J. and Eduardo P C Rocha (2011). “Horizontal transfer, not duplica- tion, drives the expansion of protein families in prokaryotes”. In: PLoS Genetics 7.1. Ed. by Nancy A. Moran, e1001284. Tuller, Tamir et al. (2011). “Association between translation efficiency and horizontal gene transfer within microbial communities”. In: Nucleic Acids Research 39.11, pp. 4743–4755. Udomsil, Natteewan et al. (2017). “Improvement of fish sauce quality by combined inoculation of Tetragenococcus halophilus MS33 and Virgibacillus sp. SK37”. In: Food Control 73, pp. 930–938. Wang, Huai Chun, Edward Susko, and Andrew J. Roger (2006). “On the correlation between genomic G+C content and optimal growth temperature in prokaryotes: Data quality and confounding factors”. In: Biochemical and Biophysical Research Communications 342.3, pp. 681–684. Wang, Jing Li et al. (2016). “Lentibacillus amyloliquefaciens sp. nov., a halophilic bacterium isolated from saline sediment sample”. In: Antonie van Leeuwenhoek, International Journal of General and Molecular Microbiology 109.2, pp. 171–178. Wang, Lei et al. (2006a). “Isolation and characterization of a novel thermophilic Bacillus strain degrading long-chain n-alkanes”. In: Extremophiles 10.4, pp. 347– 356. Wang, Stephanie T. et al. (2006b). “The forespore line of gene expression in Bacillus subtilis”. In: Journal of Molecular Biology 358.1, pp. 16–37. Wei, Xiao Xing et al. (2010). “A mini-Mu transposon-based method for multiple DNA fragment integration into bacterial genomes”. In: Applied Microbiology and Biotechnology 87.4, pp. 1533–1541. Welch, R. A. et al. (2002). “Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli”. In: Proceedings of the Na- tional Academy of Sciences 99.26, pp. 17020–17024.

183 Westers, Helga et al. (2003). “Genome Engineering Reveals Large Dispensable Re- gions in Bacillus subtilis”. In: Molecular Biology and Evolution 20.12, pp. 2076– 2090. Wickham, Hadley (2016). ggplot2: Elegant Graphics for Data Analysis. Springer- Verlag New York. Wickham, Hadley (2017). tidyverse: Easily Install and Load the ’Tidyverse’. R pack- age version 1.2.1. Wiegand, S. et al. (2013). “Complete Genome Sequence of Geobacillus sp. Strain GHH01, a Thermophilic Lipase-Secreting Bacterium”. In: Genome Announce- ments 1.2. Williams, David, J. Peter Gogarten, and R. Thane Papke (2012). “Quantifying ho- mologous replacement of loci between haloarchaeal species”. In: Genome Biology and Evolution 4.12, pp. 1223–1244. Wolf, Yuri I. and Eugene V. Koonin (2013). “Genome reduction as the dominant mode of evolution”. In: BioEssays 35.9, pp. 829–837. Wu, I. Lin et al. (2015). “A versatile nano display platform from bacterial spore coat proteins”. In: Nature Communications 6. Yeung, Enoch et al. (2017). “Biophysical Constraints Arising from Compositional Context in Synthetic Gene Networks”. In: Cell Systems 5.1, 11–24.e12. Zarei, Mina, Bianca Sclavi, and Marco Cosentino Lagomarsino (2013). “Gene si- lencing and large-scale domain structure of the E. coli genome”. In: Molecular BioSystems 9.4, p. 758. Zeigler, Daniel R. (2014). “The Geobacillus paradox: Why is a thermophilic bacterial genus so prevalent on a mesophilic planet?” In: Microbiology (United Kingdom) 160.Part 1, pp. 1–11. Zhang, Hongen, Paul Meltzer, and Sean Davis (2013). “RCircos: an R package for Circos 2D track plots”. In: BMC Bioinformatics 14, p. 244.

184 Zhang, Weiwei and Zhitang Lu (2015). “Phylogenomic evaluation of members above the species level within the phylum Firmicutes based on conserved proteins”. In: Environmental Microbiology Reports 7.2, pp. 273–281. Zhu, Chunjie et al. (2014). “Lysinibacillus varians sp. nov., An endosporeforming bacterium with a filament-to-rod cell cycle”. In: International Journal of Sys- tematic and Evolutionary Microbiology 64, pp. 3644–3649.

185 Appendix A

Supplementary Figures

186 Figure S1: Topology of HGT into individual GPA genomes Density profiles showing the relative enrichment/depletion (green/red) of HT genes (predicted at T4) across individual GPA genomes. Overall gene density is given by the black line. Two genomes in the bottom right (boxed) show significant rearrange- ment of zones and are omitted from further topological analysis, see Section 4.3.2 for details.

187 Appendix B

Supplementary Tables

188 Table S1: Bacillaceae species growth temperatures

Species Optimal temp. (℃) Source

Amphibacillus xylanus 37 * Terribacillus aidingensis 30-40 * Virgibacillus necropolis 25-35 * Virgibacillus sp. LM2416 30 Tak et al. 2018 Lentibacillus amyloliquefaciens 35 Wang et al. 2016 Virgibacillus sp. 6R ? – Virgibacillus sp. SK37 30-40 Udomsil et al. 2017 Virgibacillus halodenitrificans 30 * Oceanobacillus iheyensis 30 * Salimicrobium jeotgali 28 * Halobacillus mangrovi 35 * Halobacillus halophilus 30 * Bacillus krulwichiae 30 * Bacillus pseudofirmus 30 * Bacillus halodurans 30 * Bacillus lehensis 25 * Bacillus clausii 30 * Bacillus cellulosilyticus 37 * Bacillus beveridgei 30 * phosphorivorans 29 * Fictibacillus arsenicus 30 * Bacillus cytotoxicus 30-40 * Bacillus mycoides 25-30 * Bacillus cereus 30 * Bacillus thuringiensis 30 * Bacillus anthracis 37 Koehler 2009 Bacillus horikoshii 30 * Bacillus cohnii 30 * Bacillus endophyticus 28 * Bacillus flexus 30 * Bacillus megaterium 30 *

189 Growth temperatures (continued)

Species Optimal temp. (℃) Source

Bacillus xiamenensis 30 * Bacillus safensis 30 * Bacillus pumilus 30 * Bacillus altitudinis 25-35 * Bacillus amyloliquefaciens 30-40 * Bacillus velezensis 30 Ruiz-García et al. 2005 Bacillus subtilis 25-35 * Bacillus paralicheniformis 37 Dunlap et al. 2015 Bacillus licheniformis 30-40 * Bacillus sonorensis 30 * Bacillus glycinifermentans 37 Kim et al. 2015 Bacillus weihaiensis ?– Aeribacillus pallidus 50-60 * Anoxybacillus gonensis 50-60 Belduz, Dulger, and Demirbag 2003 Anoxybacillus flavithermus 55-60 * Geobacillus lituanicus 55 * Geobacillus stearothermophilus 50-55 * Geobacillus thermocatenulatus 60 * Geobacillus sp. GHH01 60 Wiegand et al. 2013 Geobacillus sp. C56T3 80 Brumm et al. 2015b Geobacillus sp. Y412MC52 65 Brumm et al. 2015a Geobacillus sp. Y412MC61 65 Brumm et al. 2015a Geobacillus stearothermophilus 60 Alkhalili et al. 2016 Geobacillus thermoleovorans 60 * Geobacillus kaustophilus 55 * Geobacillus subterraneus 55 * Geobacillus thermodenitrificans 60 * Geobacillus thermoleovorans 60 * Geobacillus genomosp. 3 60 Shintani et al. 2014

190 Growth temperatures (continued)

Species Optimal temp. (℃) Source

Geobacillus sp. WCH70 70 Brumm, Land, and Mead 2016 Parageobacillus thermoglucosidasius 55-60 * Geobacillus sp. Y4.1MC1 60 Brumm et al. 2015c Anoxybacillus amylolyticus 60 * Anoxybacillus sp. B2M1 60 Filippidou et al. 2016 Anoxybacillus sp. B7M1 60 Filippidou et al. 2016 Bacillus simplex 30 * Bacillus muralis 30 * Bacillus infantis 30-40 * Bacillus oceanisediminis 30-37 * Bacillus kochii 30 * Bacillus sp. 1NLA3E ? – Bacillus methanolicus 45 * Bacillus sp. X12014 ? – Lysinibacillus fusiformis 30 * Lysinibacillus varians 30-40 Zhu et al. 2014 Bacillus sp. OxB1 15-45 Asano and Kato 1998 Lysinibacillus sphaericus 30 * Bacillus smithii 55 * Bacillus coagulans 35-50 Konuray and Erginkaya 2018

* From https://bacdive.dsmz.de. ? No reference found Species ordered as in Figure 1.1; strain names removed for simplicity. For author references refer to main bibliography.

191 Appendix C

Other Publications

192 In 1840, 70 years prior to the discovery of sickled erythrocytes in humans, George Gulliver (FRS) observed that the blood corpuscles of several deer species were peculiarly misshapen. In the 1960s and 70s, the golden era of haemoglobin research, interest towards deer erythrocytes was rekindled as deer appeared to suffer no pathology related to sickling, unlike humans. While the molecular basis for human haemoglobin polymerisation, driving cellular sickling, was already known to be the E6V mutation, the molecular mechanism of deer sickling remained unknown and progress in this field ceased. In 2015, my supervisor Toby, having discovered this open question in a bout of opportunistic reading of century-old journals, proposed we bring deer sickling into the era of genomics. Following standard scientific practices such as going to the zoo, rummaging in abandoned freezers for deer tongues, and buying deer steak from exotic meat vendors online, we acquired blood and tissue samples from 15 deer species. From these, we managed to derive nucleotide and amino acid sequences of the adult deer beta-globin. Leveraging our knowledge that two species, the moose and reindeer, did not possess sickling blood cells we compared peptide sequences to identify a human-like E22V mutation unique to sickling deer. Molecular modelling by our collaborators Joe Marsh and Therese Bergendahl supported an interaction between this derived valine residue and the EF-pocket - the same site bound by E6V in human sickling. Further, investigation of beta-globin gene trees and recombination patterns suggested that incomplete lineage sorting was likely responsible for the haphazard distribution of the sickling genotype in the deer phylogeny. Finally, we invoked long- term balancing selection as the likely force driving the retention of this polymorphism over at least 10 million years of evolution. Overall, our work elucidated the molecular nature of sickling in deer and suggested that, as in humans, it might be driven by adaptive evolution. The ecological driver behind the sickling phenotype, however, remains unclear and further investigation, coupling epidemiological and population genetics studies, would be needed to fully address this.

193 Articles https://doi.org/10.1038/s41559-017-0420-3

The genetic basis and evolution of red blood cell sickling in deer

Alexander Esin1,2, L. Therese Bergendahl3, Vincent Savolainen 4,5, Joseph A. Marsh 3 and Tobias Warnecke 1,2*

Crescent-shaped red blood cells, the hallmark of sickle-cell disease, present a striking departure from the biconcave disc shape normally found in mammals. Characterized by increased mechanical fragility, sickled cells promote haemolytic anaemia and vaso-occlusions and contribute directly to disease in humans. Remarkably, a similar sickle-shaped morphology has been observed in erythrocytes from several deer species, without obvious pathological consequences. The genetic basis of erythro- cyte sickling in deer, however, remains unknown. Here, we determine the sequences of human β-globin​ orthologues in 15 deer species and use protein structural modelling to identify a sickling mechanism distinct from the human disease, coordinated by a derived valine (E22V) that is unique to sickling deer. Evidence for long-term maintenance of a trans-species sickling/non- sickling polymorphism suggests that sickling in deer is adaptive. Our results have implications for understanding the ecological regimes and molecular architectures that have promoted convergent evolution of sickling erythrocytes across vertebrates.

uman sickling is caused by a single amino acid change (E6V) in humans and deer (see below), do not promote sickling under in the adult β-globin​ (HBB) protein1. Upon deoxygenation, the same conditions19. However, whereas HbS sickling occurs when steric changes in the haemoglobin tetramer enable an interac- oxygen tension is low, deer erythrocytes sickle under high oxygen H 17 tion between 6V and a hydrophobic acceptor pocket (known as the partial pressure and at alkaline pH . Consequently, prime con- EF pocket) on the β​-surface of a second tetramer2,3. This interaction ditions for sickling in deer are likely to be found in lung capillar- promotes polymerization of mutant haemoglobin (HbS) molecules, ies (rather than in systemic capillaries where oxygen is unloaded), which ultimately coerces red blood cells into the characteristic although in vivo sickling can also be observed in peripheral venous sickle shape. Heterozygote carriers of the HbS allele are typically blood22, especially following exercise regimes that induce transient asymptomatic4, whereas HbS homozygosity has severe pathological respiratory alkalosis23. Furthermore, unlike in humans, sickled deer consequences and is linked to shortened lifespan5. Despite this, the erythrocytes do not exhibit increased mechanical fragility in vitro17,18 HbS allele has been maintained in sub-Saharan Africa by balancing and the sickling allele in white-tailed deer (previously labelled βIII​ ) selection because it confers—by incompletely understood means— is the major allele, with >​60% of individuals homozygous for β​III a degree of protection against the effects of Plasmodium infection (refs 20–24). Remarkably, β​III homozygotes do not display aberrant and malaria6. haematological values or obvious pathological traits25. Together, Sickling red blood cells were first described in 1840 (70 years these observations are consistent with reduced physiological costs of before their discovery in humans7) when unusual erythrocyte sickling in deer. However, it is unknown whether sickling is simply shapes were reported in blood from white-tailed deer (Odocoileus innocuous, as previously suggested23, or plays an HbS-like adaptive virginianus)8. Subsequent research revealed that sickling is wide- role. In addition, partial peptide digests of sickling white-tailed deer spread among deer species worldwide8–11 (Fig. 1 and Supplementary β​-globins did not recover the E6V mutation that causes sickling in Table 1). It is not, however, universal: red blood cells from reindeer humans24, leaving the genetic basis of sickling in deer unresolved. (Rangifer tarandus) and European elk (Alces alces; known as moose in North America) do not sickle and neither do erythrocytes from Results most North American wapiti (Cervus canadensis)11,12. The molecular basis of sickling in deer is distinct from that in Sickling deer erythrocytes are similar to human HbS cells with the human disease. To dissect the molecular basis of sickling in regard to their gross morphology and the tubular ultrastructure deer and elucidate its evolutionary history and potential adaptive of haemoglobin polymers13–16. Moreover, as in humans, sickling is significance, we used a combination of whole-genome sequencing, reversible through modulation of oxygen supply or pH9,17 and medi- locus-specific assembly and targeted amplification to determine 18,19 ated by specific β​-globin alleles , with both sickling and non-sick- the sequence of the HBBA gene, which encodes the adult β​-globin ling alleles segregating in wild populations of white-tailed deer20. chain, in a phylogenetically broad sample of 15 deer species, includ- As in humans, α​-globin—two copies of which join two β​-globin ing both sickling and non-sickling taxa (Fig. 1 and Supplementary proteins to form the haemoglobin tetramer—is not directly impli- Table 1). Globin genes in mammals are located in paralogue clus- cated in sickling aetiology18,21. Also, as in humans, foetal haemo- ters, which—despite a broadly conserved architecture—constitute globin molecules, which incorporate distinct β​-globin paralogues hotbeds of pseudogenization, gene duplication, conversion and

1Molecular Systems Group, Medical Research Council London Institute of Medical Sciences, Du Cane Road, London, United Kingdom. 2Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Du Cane Road, London, United Kingdom. 3Medical Research Council Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, Edinburgh, United Kingdom. 4Department of Life Sciences, Silwood Park Campus, Imperial College London, Ascot, United Kingdom. 5University of Johannesburg, Auckland Park, Johannesburg, South Africa. *e-mail: [email protected]

Nature Ecology & Evolution | VOL 2 | FEBRUARY 2018 | 367–376 | www.nature.com/natecolevol 367 © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. 194 Articles NaTuRE ECOlOgy & EVOluTiOn

11620203250405066 070 ααααα Consensus Homo sapiens Macaca mulatta Papio anubis Callithrix jacchus Mus musculus Rattus norvegicus Oryctolagus cuniculus Equus caballus Ceratotherium simum Camelus ferus Vicugna pacos Giraa camelopardalis Bos taurus Capra hircus Ovis aries Moschus berezovskii Alces alces Odocoileus virginianus Pudu puda Rangifer tarandus Capreolus capreolus Hydropotes inermis Rucervus duvaucelii Cervus albirostris Cervus nippon Cervus canadensis Cervus elaphus elaphus Cervus elaphus bactrianus Elaphurus davidianus Dama dama Muntiacus reevesi Sus scrofa Pteropus vampyrus Myotis lucifugus Eptesicus fuscus Odobenus rosmarus Ailuropoda melanoleuca Felis catus Panthera tigris 80 87 90 100 110 120 130 140 αα α

Homo sapiens Macaca mulatta Papio anubis Callithrix jacchus Mus musculus Rattus norvegicus Oryctolagus cuniculus Equus caballus Ceratotherium simum Camelus ferus Vicugna pacos Giraa camelopardalis Bos taurus Capra hircus Ovis aries Moschus berezovskii Alces alces Odocoileus virginianus Pudu puda Rangifer tarandus Capreolus capreolus Hydropotes inermis Rucervus duvaucelii Cervus albirostris Cervus nippon Cervus canadensis Cervus elaphus elaphus Cervus elaphus bactrianus Elaphurus davidianus Dama dama Muntiacus reevesi Sus scrofa Pteropus vampyrus Myotis lucifugus Eptesicus fuscus Odobenus rosmarus Ailuropoda melanoleuca Felis catus Panthera tigris

Fig. 1 | Mammalian adult β-globin peptide sequences in phylogenetic context. To facilitate comparisons with previous classic literature, residues here and in the main text are numbered according to human HBB, skipping the leading methionine. Dots represent residues identical to the consensus sequences,

defined by the most common amino acid (X indicates a tie). The key residues discussed in the text are highlighted. HBBA sequences from deer are coloured according to their documented sickling state (red, sickling; blue, non-sickling; grey, indeterminate; Supplementary Table 1). Green cylinders highlight the position of α​-helices in the secondary structure of human HBB. See Methods and Supplementary Fig. 11 for derivation of the accompanying cladogram.

loss26,27. In ruminants, the entire β-globin​ cluster is triplicated in cluster mirrors that seen in cattle, which is consistent with goats (Capra hircus)28 and duplicated in cattle (Bos taurus)29, where the duplication event pre-dating the Bovidae–Cervidae split two copies of the ancestral β-globin​ gene sub-functionalized to (Supplementary Fig. 1). Primers designed before this assembly become specifically expressed in adult (HBBA) and foetal (HBBF) became available frequently co-amplified HBBA and HBBF (see blood. Based on a recent draft assembly of a white-tailed deer Methods and Supplementary Fig. 2). In the first instance, we there- (O. virginianus texanus) genome, the architecture of the β​-globin fore assigned foetal and adult status based on residues specifically

368 Nature Ecology & Evolution | VOL 2 | FEBRUARY 2018 | 367–376 | www.nature.com/natecolevol © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. 195 NaTuRE ECOlOgy & EVOluTiOn Articles

a b R. tarandus (non-sickling) O. virginianus (sickling) c 56 α G: sickling 56 H: 4/5 non-sickling 22 56 α 120 19

120 β 87 87

+4 kBT/e –4 kBT/e Proposed sickling 22 interaction V: 8/9 sickling β E: non-sickling 120 G: sickling 87 K: 4/5 non-sickling 19 tetramer K: 4/5 non-sickling 6 Second

Q: sickling 19 E V: human sickling Axial Third tetramer N: sickling contact

K: 4/5 non-sickling 87 87 First

tetramer 22V–87Q contact d 0.9 e Oxyhaemoglobin (sickling) P. puda 0.8 Deoxyhaemoglobin (non-sickling) O. virginianus 0.7 R. duvaucelii C. nippon 0.6 C. elaphus 0.5 C. albirostris M. reevesi 0.4 D. dama 0.3 H. inermis C. capreolus 0.2 Sickling A. alces Fibre formation propensity Indeterminate 0.1 R. tarandus C. canadensis Non-sickling 0 020406080100 120140 –3.2 –3.0 –2.8 –2.6 –2.4 –2.2 –2.0 –1.8 –1.6 Residue Fibre interaction energy (kcal mol–1)

Fig. 2 | Structural basis for sicking of deer haemoglobin. a, Structure of oxyhaemoglobin (Protein Data Bank ID: 1HHO) with the key residues associated

with sickling highlighted in one of the β​-globin chains. b, Comparison of the electrostatic surfaces of oxy HBBA from a non-sickling (R. tarandus) and sickling (O. virginianus) species. The electrostatic surface is displayed in units of kBT/e, where kB is the Boltzmann constant, T is the temperature and e is

the elementary charge. c, Example of a haemoglobin fibre formed via directed docking between residues 22V and 87Q of O. virginianus oxy HBBA. d, Fibre

formation propensity derived from docking simulations centred at a given focal residue in O. virginianus oxy and deoxy HBBA. These values represent the fraction of docking models that result in HbS-like haemoglobin fibre structures. e, Fibre interaction energy for different deer species, determined by mutating the 270 22V–87Q docking models compatible with fibre formation and calculating the energy of the interaction. Error bars represent s.e.m.

shared with either HBBA or HBBF in cattle, which results in indepen- (non-sickling: K; sickling: Q/H). The change at residue 22, from an dent clustering of putative HBBA and HBBF genes on a HBBA/F gene tree ancestral to a derived valine (isoleucine in Pudu puda) (Supplementary Fig. 3). To confirm these assignments, we sequenced is reminiscent of the human HbS mutation and occurs at a site that messenger RNA from the red cell component of blood from an adult is otherwise highly conserved throughout mammalian evolution. Père David’s deer (Elaphurus davidianus) and assembled the eryth- rocyte transcriptome de novo (see Methods). We identified a highly Structural modelling supports an interaction between 22V and abundant β​-globin transcript (>​200,000 transcripts per million) cor- the EF pocket. To understand how sickling-associated amino acids responding precisely to the putative adult β​-globin gene amplified promote polymerization, we examined these residues in their pro- from genomic DNA of the same individual (Supplementary Fig. 4). tein structural context. Residue 22 lies on the surface of the hae- Reads that uniquely matched the putative HBBA gene were >​ 2,000- moglobin tetramer at the start of the second alpha helix (Fig. 2a). fold more abundant than reads uniquely matching the putative HBBF Close to residue 22 are residue 56 and two other residues that dif- gene, which is expressed at low levels. This is similar to the situation fer between non-sickling reindeer and moose (but not wapiti) and in humans, where transcripts of HBG—a distinct paralogue that con- established sickling species: 19 (non-sickling: K; sickling: N) and vergently evolved foetal expression—are found at low abundance in 120 (non-sickling: K; sickling: G/S). Together, these residues form adult blood30. Finally, our assignments are consistent with partial part of a surface of increased hydrophobicity in sickling species peptide sequences for white-tailed deer24, fallow deer (Dama dama)31 (Fig. 2b). Distal to this surface, residue 87 is situated at the perim- and reindeer32 that were previously obtained from the blood of eter of the EF pocket, which in humans interacts with 6V to later- adult individuals. ally link two β​-globin molecules in different haemoglobin tetramers 2,3,33,34 We then considered deer HBBA orthologues in a wider mamma- and stabilize the parallel strand architecture of the HbS fibre . lian context, restricting the analysis to species with high-confidence Mutation of residue 87 in humans can have marked effects on sick- HBB assignments (see Methods). Treating wapiti as non-sickling ling dynamics35. For example, erythrocytes derived from HbS/Hb and four species as indeterminate (no or insufficient phenotyping of Quebec–Chori (T87I) compound heterozygotes sickle like HbS sickling; see Supplementary Table 1), we find three residues (Fig. 1) homozygotes36, while Hb D-Ibadan (T87K) inhibits sickling37. that discriminate sickling from non-sickling species: 22 (non- Given the similarity between the human E6V mutation and sickling: E; sickling: V/I), 56 (non-sickling: H; sickling: G) and 87 E22V in sickling deer, we hypothesized that sickling occurs through

Nature Ecology & Evolution | VOL 2 | FEBRUARY 2018 | 367–376 | www.nature.com/natecolevol 369 © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. 196 Articles NaTuRE ECOlOgy & EVOluTiOn

an interaction in trans between residue 22 and the EF pocket. To the data poorly. Discordant amino acid states are found through-

test whether such an interaction is compatible with fibre forma- out the HBBA sequence and are not limited to sickling-related resi- tion, we carried out directed docking simulations centred on these dues. Furthermore, in many instances, amino acids shared between two residues using a homology model of oxy β​-globin from white- phylogenetically distant species are encoded by the same underly- tailed deer (see Methods). We then used the homodimeric inter- ing codons. Conspicuously, this includes the case of residue 120 actions from docking to build polymeric haemoglobin structures, where all three codon positions differ between sickling species analogous to how the 6V–EF interaction leads to extended fibres (GGT/AGT) and non-sickling relatives (AAG in reindeer, moose in HbS homozygotes. Strikingly, nearly half of our docking models and cattle; Supplementary Fig. 7 and Supplementary Data 1). Even resulted in HbS-like straight, parallel strand fibres (Fig. 2c). In con- if convergence were driven by selection on a narrow adaptive path trast, when we performed similar docking simulations centred on through genotype space, precise coincidence of mutational paths at residues other than 22V, nearly all were incompatible or much less multiple non-synonymous and synonymous sites must be consid- compatible with fibre formation (Fig. 2d). Out of all 145 β​-globin ered unlikely. Rather, these patterns are prima facie consistent with residues, only 19N, which forms a contiguous surface with 22V, has incomplete lineage sorting, a process that might have prominently a higher propensity to form HbS-like fibres. In contrast, when dock- accompanied the rapid divergence of Old World from New World ing is carried out using the deoxy β-globin​ structure, 22V is incom- deer during the Miocene40. patible with fibre formation, consistent with the observation that

sickling in deer occurs under oxygenated conditions. Importantly, Gene conversion affects HBBA evolution, but does not explain when this methodology is applied to human HbS, we find that 6V the phyletic pattern of sickling. To shore up this conclusion and has the highest fibre formation propensity out of all the residues rule out alternative evolutionary scenarios, we next asked whether under deoxy conditions (Supplementary Fig. 5), providing valida- identical genotypes, rather than those originating from a common tion for the approach. ancestor, might have been independently reconstituted from genetic Next, we used a force field model to compare the energetics of diversity present in other species (via introgression) or in other fibre formation across deer species. We found that known non-sick- parts of the genome (via gene conversion). To evaluate the likeli- ling species (Fig. 1) and two species suspected of being non-sickling hood of introgression and particularly gene conversion, which has based on their β​-globin primary sequence—Chinese water deer been attributed a major role in the evolution of mammalian globin (Hydropotes inermis) and roe deer (Capreolus capreolus)—exhibit genes26, we first searched for evidence of recombination in an align-

energy terms less favourable to fibre formation than sickling species ment of deer HBBA and HBBF genes. HBBF is the principal candidate (Fig. 2e). To elucidate the relative contribution of 22V and other res- to donate non-sickling residues to HBBA in a conversion event given idues to fibre formation, we introduced all single amino acid differ- that it is itself refractory to sickling19 and—as a recent duplicate of ences found among adult deer β​-globins individually into a sickling the ancestral HBBA gene—retains high levels of sequence similarity. (O. virginianus) and non-sickling (R. tarandus) background in silico Using a combination of phylogeny-based and probabilistic detection and considered the change in fibre interaction energy. Changes at methods and applying permissive criteria that allow inference of residue 22 have the strongest predicted effect on fibre formation, shorter recombinant tracts (see Methods), we identified eight can-

along with two residues—19 and 21—in its immediate vicinity didate HBBF-to-HBBA events, two of which, in Chinese water deer (Supplementary Fig. 6). Smaller effects of amino acid substitutions and wapiti, are strongly supported by different methods (Fig. 3b). at residue 87, as well as residues 117 (N in P. puda and O. virginia- Importantly, however, we found no evidence for gene conversion nus) and 118 (Y in D. dama) hint at species-specific modulation of involving residue 22 (Fig. 3b and Supplementary Fig. 8) even when sickling propensity. In silico residue swaps at a shorter evolution- considering poorly supported candidate events. Recombination

ary time-scale, between non-sickling C. canadensis and sickling sika between HBBF and/or HBBA genes therefore does not explain the deer (Cervus nippon), similarly implicate 22V as a key determinant distribution of glutamic acids and valines at residue 22 across Old of sickling (Supplementary Fig. 6). World and New World deer. Consistent with this, the removal of

Taken together, the results support the formation of HbS-like putative recombinant regions does not affect the HBBF/HBBA gene fibres in sickling deer erythrocytes via surface interactions cen- tree, with wapiti robustly clustered with other non-sickling species tred on residues 22V and 87Q in β​-globin molecules of different whereas white-tailed deer and pudu cluster with Old World sickling haemoglobin tetramers. In contrast, previous attempts to model species (Supplementary Fig. 8). We further screened raw genome interactions in the deer haemoglobin fibre, based on preliminary sequencing data from white-tailed deer and wapiti for potential 24,38,39 crystallographic data for white-tailed deer haemoglobin , either donor sequences beyond HBBF, such as HBE or pseudogenized incorrectly assumed a hexagonal fibre architecture or proposed dif- HBD sequences, but did not find additional candidate donors. Thus, ferent relative orientations and contacts that fail to predict differ- although gene conversion is a frequent phenomenon in the history ences between sickling and non-sickling chains. of mammalian globins26 and contributes to HBB evolution in deer, it does not by itself explain the phylogenetic distribution of key sick- Evidence for incomplete lineage sorting during the evolution ling/non-sickling residues. Rather, gene conversion introduces addi-

of HBBA. To shed light on the evolutionary history of sickling and tional complexity on a background of incomplete lineage sorting. elucidate its potential adaptive significance, we considered sick-

ling and non-sickling genotypes in phylogenetic context. First, Balancing selection has maintained ancestral variation in HBBA. we noted that the HBBA gene tree and species tree (derived from The presence of incomplete lineage sorting and gene conversion 20 mitochondrial and nuclear genes) are significantly discordant confounds straightforward application of rate-based (dN/dS-type) (Approximately Unbiased test, P <​ 1 ×​ 10–61; see Methods). Notably, tests for selection, making it harder to establish whether the sickling sickling and non-sickling genotypes are polyphyletic on the spe- genotype is simply tolerated or has been under selection. We there-

cies tree but monophyletic on the HBBA tree where wapiti (an Old fore examined earlier protein-level data on HBBA allelic diversity. World deer) clusters with moose and reindeer (two New World This allowed us to include additional alleles previously identified deer) (Fig. 3a). Gene tree-species tree discordance can result from from partial peptide digests, for which we have no nucleotide-level a number of evolutionary processes, including incomplete lineage data. For white-tailed deer, this includes βII​ , which is associated with sorting, gene conversion, introgression and classic convergent evo- a different flavour of polymerization that results in matchstick- lution, where point mutations arise and fix independently in dif- shaped erythrocytes41, and two rarer non-sickling alleles, βV​ and ferent lineages. In our case, the convergent evolution scenario fits β​VII. βII​ encodes 22V and expectedly clusters with other sickling

370 Nature Ecology & Evolution | VOL 2 | FEBRUARY 2018 | 367–376 | www.nature.com/natecolevol © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. 197 NaTuRE ECOlOgy & EVOluTiOn Articles

a Species tree HBBA gene tree C. capreolus A. alces

H. inermis C. capreolus 98 A. alces H. inermis

R. tarandus R. tarandus 100

P. puda C. canadensis

O. virginianus P. puda 100 100 M. reevesi O. virginianus

C. canadensis M. reevesi

C. nippon C. albirostris 100 C. albirostris C. e. elaphus

C. e. elaphus C. e. bactrianus

C. e. bactrianus C. nippon 100 E. davidianus E. davidianus

D. dama R. duvaucelii

R. duvaucelii D. dama

b Nucleotide Exon 1 Exon 2 Exon 3 identity 19 22 56 87 120 H. inermis (1) ******* C. canadensis (2) (10) ******* * O. virginianus (8) * P. puda (9) (8) * * R. tarandus (5) * D. dama (4) * R. duvaucelii *** (7)(7) (7)

c 19 22 56 87 120

H. inermis HBBA II 55 O. virginianus (β ) HBBA 47 P. puda HBBA III O. virginianus (β ) HBBA 25 69 97 M. reevesi HBBA 57 C. nippon HBBA 51 C. elaphus elaphus HBBA 43 E. davidianus HBBA 45 D. dama HBB 72 A 16 C. albirostris HBBA R. duvaucelii HBBA A. alces HBBA 49 O. virginianus texanus HBBA 20 C. capreolus HBBA 51 C. canadensis HBB 11 17 A D. dama (βII) HBB 29 A R. tarandus HBBA V 100 O. virginianus (β ) HBBA VII O. virginianus (β ) HBBA C. capreolus HBBF H. inermis HBB 97 F O. virginianus HBB 87 F 52 O. virginianus texanus HBBF 79 C. canadensis HBB 32 F C. elaphus elaphus HBBF 44 90 87 C. albirostris HBBF R. duvaucelii HBBF R. tarandus HBBF 0.04

Fig. 3 | Evidence for incomplete lineage sorting, gene conversion and a trans-species polymorphism in the evolutionary history of deer HBBA.

a, Discordances between the maximum likelihood HBBA gene tree and the species tree. Topological differences that violate the principal division into New World deer (Capreolinae) and Old World deer (Cervinae) are highlighted by solid black lines. Bootstrap values (% out of 1,000 bootstrap replicates) are

highlighted for salient nodes. b, Gene conversion and/or introgression. The top panel illustrates nucleotide identity between HBBA and HBBF orthologues (green: 100%; yellow: 30–100%; red: <​ 30% identity). The low-identity segment towards the end of intron 2 marks repeat elements present in all adult sequences, but absent from all foetal sequences. Below: predicted recombination events affecting HBBA genes (orange), with either an adult orthologue

(orange) or a foetal HBBF paralogue (green) as the predicted source, suggestive of introgression or gene conversion, respectively. The number of asterisks indicates how many detection methods (out of a maximum of seven) predicted a given event (see Methods). Details for individual events (numbered in parentheses) are given in Supplementary Fig. 8. c, Maximum likelihood protein tree of adult (orange) and foetal (green) β​-globin. Alternate non-sickling D. dama (βII​) and O. virginianus (β​V, β​VII) alleles group with non-sickling species. Amino acid identity at key sites is shown on the right. A question mark indicates that the amino acid is unresolved in the primary source. Species labels in panels a and c are coloured according to their documented sickling state (red, sickling; blue, non-sickling; grey, indeterminate; Supplementary Table 1).

Nature Ecology & Evolution | VOL 2 | FEBRUARY 2018 | 367–376 | www.nature.com/natecolevol 371 © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. 198 Articles NaTuRE ECOlOgy & EVOluTiOn

11 HBBA sequences (Fig. 3c). More importantly, the non-sickling pended in hypertonic saline and incubated at 37 °C (ref. ), making white-tailed deer alleles cluster with non-sickling HBBA ortho- it less likely that sickling frequently takes place in vivo under physi- logues rather than the conspecific β​II and β​III alleles (Fig. 3c), as does ological conditions and (2) the observation that the sickling allele is 19 the HBBA sequence from O. virginianus texanus, for which we can dominant in sheep but recessive in deer . Importantly, these results also demonstrate clustering at the nucleotide level (Supplementary also indicate that sickling evolved independently in deer (Cervidae) Fig. 3). Similarly, an alternate adult β-globin​ chain previously and their sister clade (Caprinae). observed in fallow deer31, a predominantly sickling Old World deer, clusters with non-sickling sequences (Fig. 3c). Finally, phenotypic Discussion heterogeneity in wapiti12 and sika deer42 sickling indicates that rare Given the dramatic change in erythrocyte shape brought about by sickling and non-sickling variants, respectively, also segregate in haemoglobin polymerization, it is conspicuous that multiple verte- these two species. Taken together, these findings point to the long- brate lineages have independently converged on this phenotype. In term maintenance of ancestral variation through successive specia- principle, recurrent emergence could be the result of non-adaptive tion events dating back to the most common ancestor of Old World forces. Recent findings suggest that symmetric protein complexes and New World deer, an estimated 13.6 million years ago (Ma) like haemoglobin exist at the edge of supramolecular self-assembly, (confidence interval: 9.84–17.33 Ma)43. often being a short mutational distance away from the propensity to Might this polymorphism have been maintained simply by form polymers56. However, in deer, the long-term maintenance of a chance or must balancing selection be invoked to account for its trans-species polymorphism is inconsistent with selective neutral- survival? We currently lack information on broader patterns of ity and instead argues for fitness effects along multiple lineages. By

genetic diversity at deer HBBA loci and surrounding regions that direct implication, even though sickling is remarkably well tolerated would allow us to search for footprints of balancing selection explic- in vivo, perhaps owing to unique properties of deer erythrocytes itly. However, we can estimate the probability P that a trans-species (Supplementary Discussion), it cannot be perfectly innocuous23. polymorphism has been maintained along two independent lin- Rather, it must exert a physiological effect that is strong and fre- eages by neutral processes alone as quent enough to be targeted by selection. What are the ecological driving factors behind the maintenance TN22eTNe Pe=×()()−∕e −∕ of sickling (and non-sickling) alleles over evolutionary time? It has previously been suggested that haemoglobin polymerization might, where T is the number of generations since the two lineages split by radically altering the intracellular environment of red blood 44,45 and Ne is the effective population size . For simplicity, we assume cells, provide a generic defence mechanism against red blood cell 50 Ne to be constant over T and the same for both lineages. In the parasites . Deer certainly harbour a number of intra-erythrocytic 57 58,59 absence of reliable species-wide estimates for Ne, we can nonetheless parasites, including Babesia and Plasmodium . Plasmodium was ask what Ne would be required to meet a given threshold probability. recently found to be widespread in white-tailed deer, but, interest- Conservatively assuming an average generation time of one year46,47 ingly, associated with very low levels of parasitaemia59. Also worth and a split time of 7.2 Ma (the lowest divergence time estimate in noting in this regard is the marked geographic asymmetry in sick- 43 the literature ), Ne would have to be 2,403,419 to reach a threshold ling status, where established non-sickling species are restricted to probability of 0.05. Although deer can have large census popula- arctic and subarctic (elk and reindeer) or mountainous (wapiti) tion sizes, an Ne >​ 2,000,000 for both fallow and white-tailed deer is habitats. Might this indicate that the sickling allele loses its adap- comfortably outside what we would expect for large-bodied mam- tive value in colder climates, perhaps linked to the lower prevalence mals (greater than fourfold higher than estimates for wild mice48 of blood-born parasites? Although a general cross-species link and greater than twofold higher even than estimates for African between sickling and parasite burden is tantalizing, it is important populations of Drosophila melanogaster49). Consequently, we argue to highlight that there is currently no concrete evidence for such a

that the HBBA trans-species polymorphism is inconsistent with neu- connection and alternative hypotheses should be considered. For tral evolution and instead reflects the action of balancing selection. example, with no evidence for heterozygote advantage, might it be that allelic diversity has been maintained by migration–selec- A distinct genetic basis for sickling in sheep. While sickling in tion balance? Or, do the timescales involved render such a scenario deer is particularly well-documented, the capacity for reversible improbable? Exploring geographic structure in the distribution of haemoglobin polymerization has also been observed in a small sickling and non-sickling alleles will be important in this regard coterie of other vertebrates10,50, including some species of fish50, and might point to the ecological factors involved in maintaining mongoose51 and notably also goats and sheep (Ovis aries)11,52. For either allele. More generally, future epidemiological studies coupled most of these species, we have no information on sickling-associ- with population genetic investigations will be required to unravel ated genotypes and allelic diversity. Sheep, where sickling has been the evolutionary ecology of sickling in deer and establish whether found in a variety of domestic breeds52,53, are an exception in this parasites are indeed ecological drivers of between- and within-

regard. Two HBBA alleles, HbA and HbB, were previously identi- species differences in the HBBA genotype. Ultimately, such analyses fied54. HbA homozygotes and HbA/HbB heterozygotes sickle, will determine whether deer constitute a useful comparative sys- whereas HbB homozygotes do not52. We first compared the sheep tem to elucidate the link between sickling and protection from the reference sequence (Texel breed) included in Fig. 1 with partial pep- effects of Plasmodium infection, which remains poorly understood tide information for both alleles54 and found it to be fully consis- in humans. tent with the non-sickling HbB allele. We then surveyed amino acid variation at the β​-globin gene across 75 breeds of sheep, selected to 55 Methods cover global sheep genetic diversity . We observed all seven amino Sample collection and processing. Blood, muscle tissue and DNA samples were acids known to discriminate HbA from HbB, but found no varia- acquired for 15 species of deer from a range of sources (Supplementary Table 1). tion at residues 6 or 22 (Supplementary Fig. 9), suggesting, first, The white-tailed deer blood sample was heat-treated on import to the United that the genetic diversity panel captures HbA and, second, that Kingdom in accordance with import standards for ungulate samples from non- HbA, lacking 6V and 22V, promotes polymerization by yet another European Union countries (IMP/GEN/2010/07). Fresh blood was collected into PAXgene Blood DNA tubes (PreAnalytiX) and DNA extracted using the mechanism. This conclusion is consistent with phenomenological PAXgene Blood DNA Kit (PreAnalytiX). DNA from previously frozen blood differences in sickling dynamics between deer and sheep, including samples was extracted using the QIAamp DNA Blood Mini Kit (Qiagen). DNA (1) the finding that sickling in sheep only occurs when cells are sus- from tissue samples was extracted with the QIAamp DNA Mini Kit using 25 mg

372 Nature Ecology & Evolution | VOL 2 | FEBRUARY 2018 | 367–376 | www.nature.com/natecolevol © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. 199 NaTuRE ECOlOgy & EVOluTiOn Articles

of tissue. Total RNA was isolated from an E. davidianus blood sample using the adult β​-globin (Supplementary Fig. 2). Using these primers, no product could PAXgene Blood RNA Kit three days after collection into a PAXgene Blood RNA be amplified from R. tarandus, H. inermis or C. capreolus. We identified a 3-bp Tube. All extractions were performed according to the manufacturers’ protocols. mismatch to the Ovirg_R1 primer in a partial assembly of C. capreolus (GenBank For each sample, we validated species identity by amplifying and sequencing the accession: GCA_000751575.1; scaffold: CCMK010226507.1) that is likely at fault. cytochrome b (CytB) gene. With the exception of Cervus albirostris, we successfully A re-designed reverse primer (Ccap_R1) successfully amplified the adult β​-globin amplified CytB from all samples using the primers MTCB_F/R (Supplementary gene from the three deer species above as well as C. canadensis (Supplementary Fig. 2) and conditions as described in ref. 60. Phusion High-Fidelity PCR Master Fig. 2). All amplifications were performed using Phusion High-Fidelity PCR Mix (Thermo Fisher Scientific) was used for all amplifications. Polymerase chain Master Mix, with the primers as listed in Supplementary Fig. 2 and 50–100 ng reaction products were purified using the MinElute PCR Purification Kit (Qiagen) of genomic DNA. The annealing temperature and step timing were chosen and Sanger-sequenced with the amplification primers. The CytB sequences according to the manufacturer's guidelines. Amplifications were run for 35 cycles. obtained were compared with all available deer CytB sequences in the 10kTrees Gel extractions were performed on samples resolved on 1% agarose gels for project61 using the ape package (function dist.dna with default arguments) in R 40 min at 90 V using the MinElute Gel Extraction Kit (Qiagen) and following the (ref. 62). In all cases, the presumed species identity of the sample was confirmed manufacturer’s protocol. Polymerase chain reaction purifications were performed (Supplementary Table 2). using the MinElute PCR Purification Kit following the manufacturer’s protocol. All samples were sequenced using the Sanger method with amplification primers Whole-genome sequencing. O. virginianus genomic DNA was prepared for and the primer Ovirg_Fmid2. sequencing using the NEBNext DNA Library Prep Kit (New England Biolabs) and sequenced on the Illumina HiSeq platform. The resulting 229 million 100-base pair Transcriptome sequencing and assembly. RNA was extracted from the red cell (bp) paired-end reads were filtered for adapters and quality using Trimmomatic63 component of a blood sample of an adult Père David’s deer using the PAXgene with the following parameters: ILLUMINACLIP:adapters/TruSeq3-PE-2.fa:2:30:10 Blood RNA Kit. A messenger RNA library was prepared using a TruSeq mRNA LEADING:30 TRAILING:30 SLIDINGWINDOW:4:30 MINLEN:50. Inspection of Library Prep Kit and sequenced on the MiSeq platform, yielding 25,406,472 the remaining 163.5 million read pairs with FastQC (http://www.bioinformatics. paired-end reads of length 150 bp, which were trimmed for adapters and quality- babraham.ac.uk/projects/fastqc/) suggested that over-represented sequences had filtered using Trim Galore! (http://www.bioinformatics.babraham.ac.uk/projects/ been successfully removed. trim_galore/) with a base quality threshold of 30. The trimmed reads were used as input for de novo transcriptome assembly with Trinity69 using default Mapping and partial assembly of the O. virginianus β-globin locus. To seed a parameters. A BLASTN homology search against these transcripts, using the local assembly of the O. virginianus β​-globin locus, we first mapped O. virginianus O. virginianus adult β​-globin coding sequence (CDS) as a query, identified a highly trimmed paired-end reads to the duplicated β​-globin locus in the hard-masked homologous transcript (E-value =​ 0; no gaps; 97.5% sequence identity compared B. taurus genome (UMD3.1.1; chromosome 15: 48973631–49098735). The β​-globin with 92.2% identity to the foetal β​-globin). The CDS of this putative β​-globin locus is defined here as the region including all B. taurus β​-globin genes (HBE1, transcript was 100% identical to the sequence amplified from Père David’s deer 70 HBE4, HBBA(ENSBTAG00000038748), HBE2, HBBF (ENSBTAG00000037644)), genomic DNA (Supplementary Fig. 4). We used emsar with default parameters the intervening sequences and 24 kilobases either side of the two outer β​-globins to assess transcript abundances. The three most abundant reconstructed 64 (HBE1 and HBBF). The mapping was performed using Bowtie 2 (ref. ) with transcripts correspond to full or partial α​- and β​-globin transcripts, including one default settings and the optional --no-mixed and --no-discordant parameters. 110 transcript, highlighted above, that encompasses the entire adult β​-globin CDS. reads mapped without gaps and a maximum of one nucleotide mismatch. These These transcripts are an order of magnitude more abundant than the fourth most reads, broadly dispersed across the B. taurus β​-globin locus (Supplementary abundant (Supplementary Fig. 4), in line with the expected predominance of α​- Fig. 10), were used as seeds for local assembly using a customized aTRAM65 and β​-globin transcripts in mature adult red blood cells. To investigate whether the pipeline (see below). Before assembly, the remainder of the reads were filtered for foetal β​-globin could be detected in the RNA-Seq data and because amplification repeat sequences by mapping against Cetartiodactyla repeats in Repbase66. The of the foetal β​-globin from Père David’s deer genomic DNA was not successful, aTRAM.pl wrapper script was modified to accept two new arguments: max_target_ we mapped reads against the foetal β​-globin gene of Cervus elaphus elaphus, seqs < int > limited the number of reads found by BLAST from each database shard, the closest available relative. Given that the CDSs of the adult β​-globins in these while cov_cutoff < int > passed a minimum coverage cut-off to the underlying species are 100% identical, we expected that the foetal orthologues would likewise Velvet 1.2.10 assembler67. The former modification prevents stalling when the be highly conserved. We therefore removed reads with more than one mismatch assembly encounters a repeat region, while the latter discards low coverage contigs and assembled putative transcripts from the remaining 1.3 million reads using the at the assembler level. aTRAM was run with the following arguments: -kmer 31 Geneious assembler version 10.0.5 (ref. 71) with default parameters (fastest option -max_target_seqs 2000 -ins_length 270 -exp_coverage 8 -cov_cutoff 2 -iterations enabled). We recovered a single contig with high homology to the C. elaphus 5. After local assembly on each of the 110 seed reads, the resulting contigs were elaphus foetal β​-globin CDS (only a single mismatch across the CDS). We then combined using Minimo68 with a required minimum nucleotide identity of 99%. estimated the relative abundance of adult and the putative foetal transcripts by To focus specifically on assembling the adult β​-globin gene, only contigs that calculating the proportion of reads that uniquely mapped to either the adult or mapped against the B. taurus adult β​-globin gene ±​ 500 bp (chromosome 15: foetal CDS. A total of 1,820,532 reads mapped uniquely to the adult sequence, 49022500–49025000) were retained and served as seeds for another round of whereas 872 mapped uniquely to the foetal CDS—a ratio of 2,088:1. assembly. This procedure was repeated twice. The final 59 contigs were compared with the UMD3.1.1 genome using BLAT and mapped exclusively to either the Structural analysis. Homology models were built for O. virginianus and adult or foetal B. taurus β​-globin gene. From the BLAT alignment, we identified R. tarandus β​-globin sequences using the MODELLER-9v15 programme for short sequences that were perfectly conserved between the assembled deer contigs comparative protein structure modelling72 using both oxy (1HHO) and deoxy and the B. taurus, as well as the sheep assembly (Oar_v3.1). Initial forward (2HHB) human haemoglobin structures as templates. The structures were used and reverse primers (Ovirg_F1/Ovirg_R1; Supplementary Fig. 2) for β-glob​ in for electrostatic calculations using the Adaptive Poisson–Boltzmann Solver73 amplification were designed from these conserved regions located 270 bp upstream plug-in in the Visual Molecular Dynamics programme74. The surface potentials (chromosome 15: 49022762–49022786) and 170 bp downstream (chromosome 15: were visualized in Visual Molecular Dynamics with the conventional red and blue 49024637–49024661) of the B. taurus adult β​-globin gene, respectively. Our local colours for negative and positive potential, respectively, set at ±​ 5 kBT/e. assembly is consistent with a recent draft genome assembly (https://www.ncbi.nlm. nih.gov/assembly/GCF_002102435.1/) from a white-tailed deer from Texas Modelling of haemoglobin fibres. We first used the programme HADDOCK75 (O. virginianus texanus). with the standard protein–protein docking protocol to generate ensembles of docking models of β​-globin dimers. In each docking run, a different interacting Globin gene amplification and sequencing. Amplification of β​-globin from surface centred around a specific residue was defined on each β​-globin chain. All O. virginianus using the primers Ovirg_F1 and Ovirg_R1 yielded two products residues within 3 Å of the central residue were defined as 'active' and were thus of different molecular weights (~2,000 bp and ~1,700 bp; Supplementary constrained to be directly involved in the interface, while other residues within 8 Å Fig. 2), which were isolated by gel extraction and Sanger-sequenced using the of the central residue were defined as 'passive' and were allowed (but not strictly amplification primers. The high molecular weight product had higher nucleotide constrained) to form a part of the interface. We performed docking runs with the identity to the adult (93%) than the foetal (90%) B. taurus β​-globin coding interaction centred between residue 87 and all other residues, generating at least sequence. Note that the discrepancy in size between the adult and foetal β​-globin 100 water-refined β​-globin dimer models for each (although 600 O. virginianus oxy amplicons derives from the presence of two tandem Bov-tA2 short interspersed β​-globin 22V–87Q models were built for use in the interaction energy calculations). nuclear elements in intron 2 of the adult β​-globin gene in cattle, sheep and The β​-globin dimers were then evaluated for their ability to form HbS-like fibres O. virginianus and is therefore likely ancestral. We designed a second set of primers out of full haemoglobin tetramers. Essentially, the contacts from the β​-globin dimer to anneal immediately up- and downstream, and in the middle of the adult models were used to build a chain of five haemoglobin molecules in the same β​-globin gene (Ovirg_F2, Ovirg_R2 and Ovirg_Fmid2; Supplementary Fig. 2). way that the contacts between 6V and the EF pocket lead to an extended fibre in Amplification from DNA extracts of other species with Ovirg_F1/Ovirg_R1 HbS. HbS-like fibres were defined as those in which a direct contact was formed produced mixed results, with some species showing a two-band pattern similar between the first and third haemoglobin tetramers in a chain (analogous to the to O. virginianus and others only a single band corresponding to the putative axial contacts in HbS fibres; see Fig. 2c) and in which the chain is approximately

Nature Ecology & Evolution | VOL 2 | FEBRUARY 2018 | 367–376 | www.nature.com/natecolevol 373 © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. 200 Articles NaTuRE ECOlOgy & EVOluTiOn linear. This linearity was measured as the distance between the first and third plus adult and foetal deer β​-globin genes that were successfully amplified and the distance between the third and the fifth haemoglobin tetramers, divided by sequenced, enabling all subtended detection methods (including primary scans for the distance between the first and fifth. A value of 1 would indicate a perfectly BootScan and SiScan) except LARD, treating the sequences as linear and listing all linear fibre, while we considered any chains with a value <​1.05 to be approximately detectable events. In humans, conversion tracts of lengths as short as 110 bp have linear and HbS-like. Finally, chains containing significant steric clashes between been detected in the globin genes82 and tracts as short as 50 bp have been detected haemoglobin tetramers (defined as >​3% of Cα​ atoms being within 2.8 Å of another in other gene conversion hotspots83,84. Given the presence of multiple regions of Cα​ atom) were excluded. Fibre formation propensity was then defined as the 100% nucleotide identity across the alignment of adult and foetal deer β​-globins fraction of all docking models that led to HbS-like fibres. (Fig. 3b), we suspected that equally short conversion tracts might also be present. We therefore lowered window and step sizes for all applicable detection methods Interaction energy analysis. Using the 270 22V–87Q models of O. virginianus in the Recombination Detection Program (Supplementary Fig. 8) at the cost of β​-globin dimers that can form HbS-like fibres, we used FoldX76 and the a lower signal-to-noise ratio. As the objective was to test whether recombination ‘RepairPDB’ and ‘BuildModel’ functions to mutate each dimer to the sequences events could have generated the phyletic distribution of sickling/non-sickling of all other adult deer species. Note that since C. elaphus elaphus, Cervus elaphus genotypes observed empirically, this is conservative. bactrianus and E. davidianus have identical amino acid sequences, only one of these was included here. The energy of the interaction was then calculated using Life Sciences Reporting Summary. Further information on experimental design is the ‘AnalyseComplex’ function of FoldX and then averaged over all docking available in the Life Sciences Reporting Summary. models. The same protocol was then used for the analysis of the effects of individual mutations, using all possible single amino acid substitutions observed Data availability. HBBA and HBBF full gene sequences (coding sequence plus in the adult deer sequences, except that the interaction energy was presented as the intervening introns) have been submitted to GenBank with accession numbers change with respect to the wild-type sequence. KY800429–KY800452. An alignment of these sequences is also available as Supplementary Data. Père David’s deer RNA sequencing and white-tailed Deer species tree and wider mammalian phylogeny. The mammalian phylogeny deer whole-genome sequencing raw data have been submitted to the 43 depicted in Fig. 1 is principally based on the Timetree of Life with the order European Nucleotide Archive with the accession numbers PRJEB20046 and Carnivora regrafted to branch above the root of the Chiroptera and Artiodactyla PRJEB20034, respectively. to match the findings in ref. 77. The internal topology of Cervidae was taken from 61 the Cetartiodactyla consensus tree of the 10kTrees project . C. canadensis and Received: 26 June 2017; Accepted: 20 November 2017; C. elaphus bactrianus, not included in the 10kTrees phylogeny, were added as sister branches to C. nippon and C. elaphus elaphus, respectively, following ref. 78. Published online: 18 December 2017 Supplementary Fig. 11 provides a graphical overview of these changes. To generate Fig. 1, we aligned adult deer β​-globin coding sequences to a set of non-chimeric References 26 mammalian adult β​-globin CDSs . 1. Ingram, V. M. Gene mutations in human haemoglobin: the chemical diference between normal and sickle cell haemoglobin. Nature 180, Gene tree reconstruction. All trees were built using RAxML version 8.2.10 326–328 (1957). based on alignments made with MUSCLE version 3.8.1551. Unless stated 2. Harrington, D. J., Adachi, K. & Royer, W. E. Jr Te high resolution crystal otherwise, we used the RAxML joint maximum likelihood and bootstrap structure of deoxyhemoglobin S. J. Mol. Biol. 272, 398–407 (1997). analysis (option –f a) with random seeds, a single partition and 100 bootstrap 3. Wishner, B. C., Ward, K. B., Lattman, E. E. & Love, W. E. Crystal structure replicates. The GTRGAMMA model was used for nucleotide alignments and of sickle-cell deoxyhemoglobin at 5 Å resolution. J. Mol. Biol. 98, the best-fitting protein model was automatically chosen by RAxML using the 179–194 (1975). PROTGAMMAAUTO option. 4. Sears, D. A. Te morbidity of sickle cell trait. Am. J. Med. 64, II V VII As some historical alleles (β​ , β​ and β​ ) are only available at the peptide 1021–1036 (1978). level, Fig. 3c was built at the protein level. In contrast, HBBA (±​ HBBF) trees 5. Platt, O. S. et al. Mortality in sickle cell disease. Life expectancy and risk (Fig. 3a and Supplementary Figs. 2 and 6) are nucleotide-level trees built from an factors for early death. N. Engl. J. Med. 330, 1639–1644 (1994). alignment of coding exons and intervening introns. Note here that intron 2, which 6. Piel, F. B. et al. Global distribution of the sickle cell gene and geographical is comparatively long and less constrained than coding sequence, contributes confrmation of the malaria hypothesis. Nat. Commun. 1, 104 (2010). a comparatively large number of phylogenetically informative sites. In fact, the 7. Herrick, J. B. Peculiar elongated and sickle-shaped red blood corpuscles in a intron 2 tree re-capitulates the exon +​ intron tree almost perfectly, with a minor case of severe anemia. Arch. Int. Med. 5, 517–521 (1910). difference in the precise location of C. canadensis in the non-sickling cluster. Note 8. Gulliver, G. Observations on certain peculiarities of form in the blood further that a large comparative contribution of intron 2 to the overall phylogenetic corpuscles of the mammiferous animals. Lond. Edinb. Dubl. Philos. Mag. 17, signal is fortuitous in this context. To understand patterns of lineage sorting and 325–331 (1840). introgression, it is desirable to eliminate spurious phylogenetic signals introduced 9. Undritz, E., Betke, K. & Lehmann, H. Sickling phenomenon in deer. Nature by gene conversion, which strongly affects the exonic sequence (as is evident in 187, 333–334 (1960). Fig. 3b), but is much less prevalent in intron 2. Since we explicitly demonstrate 10. Hawkey, C. M. Comparative Mammalian Haematology (Heinemann (in Supplementary Fig. 6) that including sequence affected by gene conversion does Educational Books, 1975). not affect the overall tree topology, we present exon +​ intron (that is, gene) trees 11. Butcher, P. D. & Hawkey, C. M. Haemoglobins and erythrocyte sickling in the throughout for simplicity. artiodactyla: a survey. Comp. Biochem. Physiol. A Physiol. 57, 391–398 (1977). 12. Weber, Y. B. & Giacometti, L. Sickling phenomenon in the erythrocytes of Topology testing. To test for significant phylogenetic discordance between wapiti (Cervus canadensis). J. Mammal. 53, 917–919 (1972). the HBBA gene tree and the species tree as depicted in Fig. 3a, we compared 13. Simpson, C. F. & Taylor, W. J. Ultrastructure of sickled deer erythrocytes. I. both topologies using the Approximately Unbiased test79 implemented in Te typical crescent and holly leaf forms. Blood 43, 899–906 (1974). 80 CONSEL . The unconstrained maximum likelihood HBBA gene tree was 14. Schmidt, W. C. et al. Te structure of sickling deer type III hemoglobin by tested against an alternative maximum likelihood tree (derived from 200 molecular replacement. Acta Cryst. B Struct. Sci. Cryst. Eng. Mat. 33, maximum likelihood starting trees) built under a single constraint to recover 335–343 (1977). the well-established monophyletic groups of Old World and New World deer. 15. Pritchard, W. R., Malewitz, T. D. & Kitchen, H. Studies on the mechanism of Branching patterns within these major clades were allowed to vary. With sickling of deer erythrocytes. Exp. Mol. Pathol. 2, 173–182 (1963). this approach, we conservatively tested the significance of the incongruent 16. Kitchen, H., Easley, C. W., Putnam, F. W. & Taylor, W. J. Structural placement of O. virginianus and P. pu du sickling alleles with the Old World deer comparison of polymorphic hemoglobins of deer with those of sheep and (and C. canadensis with New World deer) without considering confounding other species. J. Biol. Chem. 243, 1204–1211 (1968). signals from within-clade branching that might arise, for example, due to gene 17. Seifge, D. Haemorheological studies of the sickle cell phenomenon in conversion. Both the constrained and unconstrained maximum likelihood trees European red deer (Cervus elaphus). Blut 47, 85–92 (1983). were calculated with RAxML as described above. Per site log-likelihoods were 18. Kitchen, H., Putnam, F. W. & Taylor, W. J. Hemoglobin polymorphism: its computed for the unconstrained and constrained maximum likelihood trees with relation to sickling of erythrocytes in white-tailed deer. Science 144, RAxML (option –f G). 1237–1239 (1964). 19. Taylor, W. J. & Easley, C. W. Sickling phenomena of deer. Ann. NY Acad. Sci. Detection of recombination events. We considered two sources of donor 241, 594–604 (1974). sequence for recombination into adult β​-globins: adult β​-globin orthologues in 20. Harris, M. J., Huisman, T. H. J. & Hayes, F. A. Geographic distribution of other deer species and the foetal β​-globin paralogue within the same genome. hemoglobin variants in the white-tailed deer. J. Mammal. 54, 270–274 (1973). H. inermis HBBF was omitted from this analysis since the sequence of intron 2 21. Harris, M. J., Wilson, J. B. & Huisman, T. H. J. Structural studies of was only partially determined. We used the Recombination Detection Program hemoglobin α​ chains from Virginia white-tailed deer. Arch. Biochem. Biophys. (version 4.83)81 to test for signals of recombination in an alignment of complete 151, 540–548 (1972).

374 Nature Ecology & Evolution | VOL 2 | FEBRUARY 2018 | 367–376 | www.nature.com/natecolevol © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. 201 NaTuRE ECOlOgy & EVOluTiOn Articles

22. Parshall, C. J., Vainisi, S. J., Goldberg, M. F. & Wolf, E. D. In vivo erythrocyte 50. Koldkjær, P., McDonald, M. D., Prior, I. & Berenbrink, M. Pronounced in sickling in the Japanese sika deer (Cervus nippon): methodology. Am. J. Vet. vivo hemoglobin polymerization in red blood cells of Gulf toadfsh: a general Res 36, 749–752 (1975). role for hemoglobin aggregation in vertebrate hemoparasite defense? 23. Whitten, C. F. Innocuous nature of the sickling (pseudosickling) phenomenon Am. J. Physiol. Regul. Integr. Comp. Physiol. 305, R1190–R1199 (2013). in deer. Br. J. Haematol. 13, 650–655 (1967). 51. Hawkey, C. M. & Jordan, P. Sickle-cell erythrocytes in the mongoose 24. Shimizu, K. et al. Te primary sequence of the beta chain of Hb type III of Herpestes sanguineus. Trans. R. Soc. Trop. Med. Hyg. 61, 180–181 (1967). the Virginia white-tailed deer (Odocoilus virginianus), a comparison with 52. Butcher, P. D. & Hawkey, C. M. Te nature of erythrocyte sickling in sheep. putative sequences of the beta chains from four additional deer hemoglobins, Comp. Biochem. Physiol. A Physiol. 64, 411–418 (1979). types II, IV, V, and VIII, and relationships between intermolecular contacts, 53. Evans, E. T. R. Sickling phenomenon in sheep. Nature 217, 74–75 (1968). primary sequence and sickling of deer hemoglobins. Hemoglobin 7, 15–45 (1983). 54. Tucker, E. M. Genetic variation in the sheep red blood cell. Biol. Rev. Camb. 25. Kitchen, H. & Taylor, W. J. Te sickling phenomenon of deer erythrocytes. Philos. Soc. 46, 341–386 (1971). Adv. Exp. Med. Biol. 28, 325–336 (1972). 55. Kijas, J. W. et al. Genome-wide analysis of the world’s sheep breeds reveals 26. Gaudry, M. J., Storz, J. F., Butts, G. T., Campbell, K. L. & Hofmann, F. G. high levels of historic mixture and strong recent selection. PLoS Biol. 10, Repeated evolution of chimeric fusion genes in the β​-globin gene family of e1001258 (2012). laurasiatherian mammals. Genome Biol. Evol. 6, 1219–1234 (2014). 56. Garcia-Seisdedos, H., Empereur-Mot, C., Elad, N. & Levy, E. D. Proteins 27. Hardison, R. C. Evolution of hemoglobin and its genes. Cold Spring Harb. evolve on the edge of supramolecular self-assembly. Nature 548, Perspect. Med. 2, a011627 (2012). 244–247 (2017). 28. Townes, T. M., Fitzgerald, M. C. & Lingrel, J. B. Triplication of a four-gene set 57. Perry, B. D., Nichols, D. K. & Cullom, E. S. Babesia odocoilei Emerson and during evolution of the goat beta-globin locus produced three genes now Wright, 1970 in white-tailed deer, Odocoileus virginianus (Zimmermann), in expressed diferentially during development. Proc. Natl Acad. Sci. USA 81, Virginia. J. Wildl. Dis. 21, 149–152 (1985). 6589–6593 (1984). 58. Garnham, P. C. & Kuttler, K. L. A malaria parasite of the white-tailed deer 29. Schimenti, J. C. & Duncan, C. H. Structure and organization of the bovine (Odocoileus virginianus) and its relation with known species of Plasmodium beta-globin genes. Mol. Biol. Evol. 2, 514–525 (1985). in other ungulates. Proc. R. Soc. Lond. B Biol. Sci. 206, 395–402 (1980). 30. Craig, J. E., Tein, S. L. & Rochette, J. Fetal hemoglobin levels in adults. 59. Martinsen, E. S. et al. Hidden in plain sight: cryptic and endemic malaria Blood Rev. 8, 213–224 (1994). parasites in North American white-tailed deer (Odocoileus virginianus). 31. Angeletti, M. et al. Diferent functional modulation by heterotropic ligands Sci. Adv. 2, e1501486 (2016). (2,3‐diphosphoglycerate and chlorides) of the two haemoglobins from 60. Naidu, A., Fitak, R. R., Munguia Vega, A. & Culver, M. Novel primers for fallow‐deer (Dama dama). Eur. J. Biochem. 268, 603–611 (2001). complete mitochondrial cytochrome b gene sequencing in mammals. 32. Petruzzelli, R. et al. Te primary structure of hemoglobin from reindeer Mol. Ecol. Resour. 12, 191–196 (2012). (Rangifer tarandus tarandus) and its functional implications. Biochim. 61. Arnold, C., Matthews, L. J. & Nunn, C. L. Te 10kTrees website: a new online Biophys. Acta Prot. Struct. Mol. Enzymol. 1076, 221–224 (1991). resource for primate phylogeny. Evol. Anthropol. 19, 114–118 (2010). 33. Adachi, K., Reddy, L. R. & Surrey, S. Role of hydrophobicity of phenylalanine 62. Paradis, E., Claude, J. & Strimmer, K. APE: analyses of phylogenetics and beta 85 and leucine beta 88 in the acceptor pocket for valine beta 6 during evolution in R language. Bioinformatics 20, 289–290 (2004). hemoglobin S polymerization. J. Biol. Chem. 269, 31563–31566 (1994). 63. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a fexible trimmer for 34. Nagel, R. L. et al. Beta-chain contact sites in the haemoglobin S polymer. Illumina sequence data. Bioinformatics 30, 2114–2120 (2014). Nature 283, 832–834 (1980). 64. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. 35. Adachi, K., Konitzer, P. & Surrey, S. Role of gamma 87 Gln in the inhibition Nat. Methods 9, 357–359 (2012). of hemoglobin S polymerization by hemoglobin F. J. Biol. Chem. 269, 65. Allen, J. M., Huang, D. I., Cronk, Q. C. & Johnson, K. P. aTRAM—automated 9562–9567 (1994). target restricted assembly method: a fast method for assembling loci across 36. Witkowska, H. E. et al. Sickle cell disease in a patient with sickle cell trait and divergent taxa from next-generation sequencing data. BMC Bioinformatics 16, compound heterozygosity for hemoglobin S and hemoglobin Quebec–Chori. 98 (2015). N. Engl. J. Med. 325, 1150–1154 (1991). 66. Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive 37. Watson-Williams, E. J., Beale, D., Irvine, D. & Lehmann, H. A new elements in eukaryotic genomes. Mob. DNA 6, 11 (2015). haemoglobin, D Ibadan (beta-87 threonine →​ lysine), producing no 67. Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read sickle-cell haemoglobin D disease with haemoglobin S. Nature 205, assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008). 1273–1276 (1965). 68. Treangen, T. J., Sommer, D. D., Angly, F. E., Koren, S. & Pop, M. Next 38. Amma, E. L., Sproul, G. D., Wong, S. & Huisman, T. H. J. Mechanism of generation sequence assembly with AMOS. Curr. Protoc. Bioinformatics 11, sickling in deer erythrocytes. Ann. NY Acad. Sci. 241, 605–613 (1974). 11.8 (2011). 39. Girling, R. L., Schmidt, W. C. Jr, Houston, T. E., Amma, E. L. & Huisman, T. 69. Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-Seq H. J. Molecular packing and intermolecular contacts of sickling deer type III using the Trinity platform for reference generation and analysis. Nat. Protoc. hemoglobin. J. Mol. Biol. 131, 417–433 (1979). 8, 1494–1512 (2013). 40. Fernández, M. H. & Vrba, E. S. A complete estimate of the phylogenetic 70. Lee, S. et al. EMSAR: estimation of transcript abundance from RNA-Seq data relationships in Ruminantia: a dated species-level supertree of the extant by mappability-based segmentation and reclustering. BMC Bioinformatics 16, ruminants. Biol. Rev. 80, 269–302 (2005). 278 (2015). 41. Taylor, W. J. & Simpson, C. F. Ultrastructure of sickled deer erythrocytes. II. 71. Kearse, M. et al. Geneious Basic: an integrated and extendable desktop Te matchstick cell. Blood 43, 907–914 (1974). sofware platform for the organization and analysis of sequence data. 42. Butcher, P. D. & Hawkey, C. M. in Te Comparative Pathology of Zoo Animals Bioinformatics 28, 1647–1649 (2012). (eds Montali, R. J. & Migaki, G.) 633–641 (Smithsonian Institute, 72. Eswar, N. et al. Comparative protein structure modeling using MODELLER. Washington, 1980). Curr. Protoc. Protein Sci. 2, 2.9 (2007). 43. Hedges, S. B., Marin, J., Suleski, M., Paymer, M. & Kumar, S. Tree of life 73. Baker, N. A., Sept, D., Joseph, S., Holst, M. J. & McCammon, J. A. reveals clock-like speciation and diversifcation. Mol. Biol. Evol. 32, Electrostatics of nanosystems: application to microtubules and the ribosome. 835–845 (2015). Proc. Natl Acad. Sci. USA 98, 10037–10041 (2001). 44. Wiuf, C., Zhao, K., Innan, H. & Nordborg, M. Te probability and 74. Humphrey, W., Dalke, A. & Schulten, K. VMD: visual molecular dynamics. chromosomal extent of trans-specifc polymorphism. Genetics 168, J. Mol. Graph. 14, 33–38 (1996). 2363–2372 (2004). 75. Dominguez, C., Boelens, R. & Bonvin, A. M. HADDOCK: a protein−​protein 45. Gao, Z., Przeworski, M. & Sella, G. Footprints of ancient‐balanced docking approach based on biochemical or biophysical information. polymorphisms in genetic variation data from closely related species. J. Am. Chem. Soc. 125, 1731–1737 (2003). Evolution 69, 431–446 (2015). 76. Guerois, R., Nielsen, J. E. & Serrano, L. Predicting changes in the stability of 46. Baker, K. H. et al. Strong population structure in a species manipulated by proteins and protein complexes: a study of more than 1000 mutations. humans since the Neolithic: the European fallow deer (Dama dama dama). J. Mol. Biol. 320, 369–387 (2002). Heredity 119, 16–26 (2017). 77. Meredith, R. W. et al. Impacts of the Cretaceous terrestrial revolution and 47. Ryman, N., Baccus, R., Reuterwall, C. & Smith, M. H. Efective population KPg extinction on mammal diversifcation. Science 334, 521–524 (2011). size, generation interval, and potential loss of genetic variability in game 78. Ludt, C. J., Schroeder, W., Rottmann, O. & Kuehn, R. Mitochondrial DNA species under diferent hunting regimes. Oikos 36, 257–266 (1981). phylogeography of red deer (Cervus elaphus). Mol. Phylogenet. Evol. 31, 48. Halligan, D. L., Oliver, F., Eyre-Walker, A., Harr, B. & Keightley, P. D. 1064–1083 (2004). Evidence for pervasive adaptive protein evolution in wild mice. PLoS Genet. 79. Shimodaira, H. An approximately unbiased test of phylogenetic tree selection. 6, e1000825 (2010). Syst. Biol. 51, 492–508 (2002). 49. Shapiro, J. A. et al. Adaptive genic evolution in the Drosophila genomes. 80. Shimodaira, H. & Hasegawa, M. CONSEL: for assessing the confdence of Proc. Natl Acad. Sci. USA 104, 2271–2276 (2007). phylogenetic tree selection. Bioinformatics 17, 1246–1247 (2001).

Nature Ecology & Evolution | VOL 2 | FEBRUARY 2018 | 367–376 | www.nature.com/natecolevol 375 © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. 202 Articles NaTuRE ECOlOgy & EVOluTiOn

81. Martin, D. P., Murrell, B., Golden, M., Khoosal, A. & Muhire, B. RDP4: to V.S., and Medical Research Council core funding and an Imperial College Junior detection and analysis of recombination patterns in virus genomes. Virus Research Fellowship to T.W. Evol. 1, vev003 (2015). 82. Papadakis, M. N. & Patrinos, G. P. Contribution of gene conversion in the Author contributions evolution of the human beta-like globin gene family. Hum. Genet. 104, A.E. performed the laboratory experiments and evolutionary analyses and contributed 117–125 (1999). to the experimental design, data analysis and interpretation. L.T.B. and J.A.M. designed 83. Jefreys, A. J. & May, C. A. Intense and highly localized gene conversion and performed the structural modelling and contributed to the data analysis and activity in human meiotic crossover hot spots. Nat. Genet. 36, interpretation. V.S. contributed tissue samples. T.W. conceived the study, contributed to 151–156 (2004). the experimental design, data analysis and interpretation, and wrote the manuscript with 84. Bosch, E., Hurles, M. E., Navarro, A. & Jobling, M. A. Dynamics of a human input from all authors. interparalog gene conversion hotspot. Genome Res. 14, 835–844 (2004). Competing interests Acknowledgements The authors declare no competing financial interests. We thank the Zoological Society of London Whipsnade Zoo (F. Molenaar), Bristol Zoological Society (S. Dow and K. Wyatt), the Royal Zoological Society of Scotland Highland Wildlife Park (J. Morse), the British Deer Society, the Penn State Deer Research Additional information Center (D. Wagner) and the Northeast Wildlife DNA Laboratory (N. Chinnici) for Supplementary information is available for this paper at https://doi.org/10.1038/ samples, the Medical Research Council London Institute of Medical Sciences Genomics s41559-017-0420-3. Facility for DNA and RNA sequencing, B. N. Sacks, J. Mizzi and T. Brown for access Reprints and permissions information is available at www.nature.com/reprints. to tule elk sequencing data, P. D. Butcher for discussions, and P. Sarkies, A. Brown and B. Lehner for comments on the manuscript. This work was supported by an Imperial Correspondence and requests for materials should be addressed to T.W. College Interdisciplinary Cross-Campus Studentship to A.E., a Medical Research Council Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in Career Development Award (MR/M02122X/1) to J.A.M., a Leverhulme Trust Fellowship published maps and institutional affiliations.

376 Nature Ecology & Evolution | VOL 2 | FEBRUARY 2018 | 367–376 | www.nature.com/natecolevol © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. 203