SCALABLE ALIGNMENT-FREE APPROACHES IN MICROBIAL PHYLOGENOMICS Guillaume Bernard Bachelor of Cellular Biology & Physiology and Master in Bioinformatics

A thesis submitted for the degree of Doctor of Philosophy at

The University of Queensland in 2017

Institute for Molecular Bioscience

Abstract In the 1970s Carl Woese and colleagues discovered the third domain of life by comparing oligonucleotide catalogs of 16S/18S rRNAs. Four decades later, phylogenetic studies are mostly based on multiple sequence alignment (MSA) approaches. However, genome evolution in microbes involves highly dynamic molecular mechanisms including genome rearrangement and lateral genetic transfer (LGT). These mechanisms can potentially violate the implicit assumption of full-length contiguity in MSA. Furthermore, commonly used MSA-based approaches can necessitate the use of heuristic methods, e.g. Bayesian inference, in reconstructing phylogenies, and these may not be scalable to the quantity of existing and forthcoming genome data. In recent years, alignment-free (AF) methods have been developed as an alternative strategy to infer evolutionary relatedness based on shared subsequences of fixed length, known as k-mers, similarly to Woese’s preliminary work. In this thesis, I aimed to study the complex evolution of microbial genomes with the development of novel AF approaches, and systematic assessment of the AF methods’ potential for phylogenetic inference. This could potentially provide new insight onto microbial evolution and change the way we do phylogenomics, i.e. potentially lead to the development of “next-generation phylogenomics”.

The thesis starts with a brief overview of the diversity of microbial life, and the difficulties in understanding microbial evolution due to complex phenomena such as LGT or rearrangement. I explain how phylogenomic approaches can be used to understand microbial evolution, and describe distinct approaches based on MSA and AF.

The second chapter is a literature review of the conceptual foundations of alignment-free approaches for the inference of phylogenetic relationships of genome sequences. I discuss the limitations of MSA-based approaches, introduce the concept of k-mers, present in detail the different families of alignment-free approaches and describe their applications to infer vertical and lateral phylogenetic signal among microbial genomes.

The three result chapters are presented in the form of research papers, each with its own introduction, methods, results and discussion.

In the first research chapter, I examined the performance of AF approaches in recovering accurate phylogenies of bacterial protein and nucleotide sequences simulated under diverse evolutionary scenarios. I implemented an AF approach to infer phylogenies and compared the robustness of a class of AF methods, the !" statistics, with an MSA-based approach against among-site rate heterogeneity, compositional biases, genetic rearrangements, insertions/deletions, sequence divergence and sequence truncation. I also assessed the scalability of these methods on simulated and empirical data. This work demonstrated that compared to a MSA approach, AF methods are

i more robust against among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions, but are more sensitive to recent sequence divergence and sequence truncation. The AF methods were found to be accurate, scalable and computationally efficient.

In the second research chapter, I systematically assessed the sensitivity and scalability of nine AF methods to genome-scale evolutionary events, including sequence divergence, LGT and rearrangement. The methods selected represent the two families of AF methods, those based on word counts (with exact or inexact k-mers) and those based on match lengths (with or without mismatches). I found that most AF methods are robust against rearrangement and a moderate amount of LGT, and I identified optimal parameters. I also examined the scalability of these methods at genome scale, and found that while remaining fast, their scalability differs between the two families. I also introduced a new application of the jackknife technique to provide node- support values to phylogenies inferred by AF approaches, and showed that these values are biologically meaningful.

# In the third results chapter, I implemented an AF approach (based on the !" statistic) to infer phylogenomic networks for a large dataset of complete genomes of and Archaea. I reconstructed a phylogenomic network of microbial life using 2785 completely sequenced bacterial and archaeal genomes, and systematically assessed the impact of ribosomal RNA and plasmid sequences in this network. By implementing and varying a distance threshold, I captured changes in the network structure, e.g. cliques, that reflect the evolutionary dynamics of microbial genomes. I linked the implicated k-mers to annotated genomic regions (thus functions) using a database approach, and defined the term core k-mers. These findings indicate that AF phylogenomics is not limited to tree inference, but can also provide new insight into microbial evolution by combining network analysis and the use of a relational k-mer database in a scalable manner.

ii Declaration by author

This thesis is composed of my original work, and contains no material previously published or written by another person except where due reference has been made in the text. I have clearly stated the contribution by others to jointly-authored works that I have included in my thesis.

I have clearly stated the contribution of others to my thesis as a whole, including statistical assistance, survey design, data analysis, significant technical procedures, professional editorial advice, and any other original research work used or reported in my thesis. The content of my thesis is the result of work I have carried out since the commencement of my research higher degree candidature and does not include a substantial part of work that has been submitted to qualify for the award of any other degree or diploma in any university or other tertiary institution. I have clearly stated which parts of my thesis, if any, have been submitted to qualify for another award.

I acknowledge that an electronic copy of my thesis must be lodged with the University Library and, subject to the policy and procedures of The University of Queensland, the thesis be made available for research and study in accordance with the Copyright Act 1968 unless a period of embargo has been approved by the Dean of the Graduate School.

I acknowledge that copyright of all material contained in my thesis resides with the copyright holder(s) of that material. Where appropriate I have obtained copyright permission from the copyright holder to reproduce material in this thesis.

iii Publications during candidature

G. Bernard, Paul Greenfield, M. A. Ragan and C. X. Chan. K-mer similarity, networks of microbial genomes and taxonomic rank. bioRxiv 125237, DOI: 10.1101/125237, 2017.

G. Bernard, M. A. Ragan. and C. X. Chan. Recapitulating phylogenies using k-mers: from trees to networks. F1000Research 5, 2789, DOI: 10.1038/srep28970, 2016.

G. Bernard, C. X. Chan and M. A. Ragan. Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer, Scientific Reports, 6:28970, DOI: 10.1038/srep28970, 2016.

C. X. Chan, G. Bernard, O. Poirion, J. M. Hogan, and M. A. Ragan. Inferring phylogenies of evolving sequences without multiple sequence alignment. Scientific Reports, 4:6504, DOI: 10.1038/srep06504, 2014.

M. A. Ragan, G. Bernard and C. X. Chan. Molecular phylogenetics before sequences: Oligonucleotide catalogs as k-mer spectra. RNA Biology 11(3):176-185, DOI: 10.4161/rna.27505, 2014.

iv Publications included in this thesis

M. A. Ragan, G. Bernard and C. X. Chan. Molecular phylogenetics before sequences: Oligonucleotide catalogs as k-mer spectra. RNA Biology 11(3):176-185, 2014 - incorporated as part of Chapter 3.

Contributor Statement of contribution

Author Guillaume Bernard (Candidate) Designed experiments (80%)

Wrote and edited paper (35%)

Figures and tables (100%)

Author Cheong Xin Chan Designed experiments (10%)

Wrote and edited paper (10%)

Author Mark A. Ragan Designed experiments (20%)

Wrote the paper (55%)

v C. X. Chan, G. Bernard, O. Poirion, J. M. Hogan, and M. A. Ragan. Inferring phylogenies of evolving sequences without multiple sequence alignment. Scientific Reports, 4:6504, 2014 - incorporated as part of Chapter 3.

Contributor Statement of contribution

Author Guillaume Bernard (Candidate) Designed experiments (60%)

Wrote and edited paper (30%)

Figures and tables (50%)

Author Cheong Xin Chan Designed experiments (20%)

Wrote the paper (55%)

Figures and tables (50%)

Author Mark A. Ragan Designed experiments (20%)

Wrote and edited paper (10%)

Author James M. Hogan Designed experiments (5%)

Wrote and edited paper (5%)

Author Olivier Poirion Designed experiments (5%)

vi G. Bernard, C. X. Chan and M. A. Ragan. Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer, Scientific Reports, 6:28970, 2016 - incorporated as Chapter 4.

Contributor Statement of contribution

Author Guillaume Bernard (Candidate) Designed experiments (80%)

Wrote the paper (60%)

Figures and tables (100%)

Author Cheong Xin Chan Designed experiments (10%)

Wrote and edited paper (20%)

Author Mark A. Ragan Designed experiments (10%)

Wrote and edited paper (20%)

G. Bernard, C. X. Chan and M. A. Ragan. Recapitulating phylogenies using k-mers: from trees to networks. F1000Research 5, 2789, 2016 - incorporated as part of Chapter 5.

Contributor Statement of contribution

Author Guillaume Bernard (Candidate) Designed experiments (100%)

Wrote the paper (60%)

Figures and tables (100%)

Author Cheong Xin Chan Wrote and edited paper (30%)

Author Mark A. Ragan Wrote and edited paper (10%)

vii G. Bernard, Paul Greenfield, M. A. Ragan and C. X. Chan. K-mer similarity, networks of microbial genomes and taxonomic rank. bioRxiv 125237, 2017 - incorporated as part of Chapter 5

Contributor Statement of contribution

Author Guillaume Bernard (Candidate) Designed experiments (70%)

Wrote the paper (50%)

Figures and tables (100%)

Author Cheong Xin Chan Wrote and edited paper (20%)

Author Mark A. Ragan Wrote and edited paper (20%)

Author Paul Greenfield Designed experiments (30%)

Wrote and edited paper (10%)

viii Contributions by others to the thesis

No contributions by others.

Statement of parts of the thesis submitted to qualify for the award of another degree

None.

ix Acknowledgements

First and foremost, I would like to thank my Principal Supervisor Prof. Mark A. Ragan for giving me the opportunity to do my PhD under his supervision, I could not have wished for a better supervisor. Despite your busy schedule you were always available to talk about my project. You are the most brilliant mind that I know and it was an honor to be your student.

C.X, Chan .

Thank you to everyone in the Ragan Group, particularly Lana, for everything you have done for me over the past four years. Huanle, Tim and Raul have been the best company during this PhD. Thank you to everyone at IMB for the great working environment and to give me the chance to meet so many great scientists. Huge thanks to Amanda, Cody and Olga, for everything they did for me, you are the best.

Many thanks to everyone at the UQ karate club, I couldn’t have done this without you all. I would like to thank Sensei Sinn for everything he taught me. I will continue my karate journey with your precious lessons in mind.

Tambien me gust aria dar las gracias a mis compañeros de casa de la Calle Rosecliffe 61: Christina, Kevin, Rob, Alex, Gerwin, Francesco, Micha y Rachel. Nos lo pasamos genial en "nuestro hogar". Os echo de menos, pero soy consciente de que nos veremos muy pronto. And another thanks to Alex and Micha for being there for me when I needed it the most, it meant a lot.

Thanks to all the amasing people I met in Australia, who contributed to these awesome four years. You know who you are and I will always remember you. Ja klar und vielen lieben Dank an Clemens, Jasmin, Katharina und Morana, dafür das sie immer für mich da waren und all die schönen Momente die wir hatten.

از دوﺳﺖ ﻋﺰﯾﺰم ﻋﺎطﻔﮫ ﺑﺴﯿﺎر ﺳﭙﺎﺳﮕﺬارم ﭼﺮا ﻛﮫ در طﻮل اﯾﻦ راه ھﻤﯿﺸﮫ ھﻤﺮاه ﻣﻦ ﺑﻮد

Pour finir, je voudrais remercier mes grand-parents, ma mère, mon père, mes frères, et mes amis, qui m’ont toujours soutenu depuis la France ou ailleur, malgré la distance et le décalage horaire. Je vous aime tous.

x Keywords

Alignment-free, phylogenomics, evolutionary biology, phylogeny, k-mers, network, lateral genetic transfer, microbes, bacteria, archaea

Australian and New Zealand Standard Research Classifications (ANZSRC)

ANZSRC code: 060309, Phylogeny and Comparative Analysis, 50%

ANZSRC code: 060408, Genomics, 30%

ANZSRC code: 060102, Bioinformatics, 20%

Fields of Research (FoR) Classification

FoR code: 0603, Evolutionary Biology, 50%

FoR code: 0604, Genetics, 30%

FoR code: 0601, Biochemistry and Cell Biology, 20%

xi Table of contents

Abstract ...... i

Table of contents ...... xii

List of Figures and Tables ...... xv

Chapter 1: Introduction ...... 1 1.1 Microbial evolution ...... 1 1.2 Classical phylogenetic approaches ...... 2 1.2.1 Phylogenetic inference & multiple sequence alignment ...... 3 1.2.2 Phylogenetic trees ...... 4 1.2.3 Phylogenetic networks ...... 6 1.3 Alignment-free phylogenetic approaches ...... 7 1.4 Research problem ...... 8 1.5 Aims ...... 10 1.6 Thesis outline ...... 10 1.7 References ...... 11

Chapter 2: Alignment-free inference of hierarchical and reticulate phylogenomic relationships ...... 17 2.1 Alignment-free inference of hierarchical and reticulate phylogenomic relationships ...... 17 2.1.1 Abstract ...... 17 2.1.2 Introduction ...... 17 2.1.3 Critique of classical (alignment-based) phylogenetics ...... 18 2.1.4 Alignment-free methods and k-mers ...... 20 2.1.5 Phylogenetic inference based on k-mers ...... 23 2.1.6 Alignment-free approaches to lateral genetic transfer ...... 26 2.1.7 Conclusion ...... 31 2.1.8 References ...... 32

Chapter 3: Phylogenetic inference of molecular sequences using alignment-free approaches ...... 42 3.1 Molecular phylogenetics before sequences: Oligonucleotide catalogs as k-mer spectra ...... 43 3.1.1 Abstract ...... 43 3.1.2 Introduction ...... 43 3.1.3 Results ...... 50 3.1.4 Discussion ...... 56 3.1.5 Materials and Methods ...... 57

xii 3.1.6 References ...... 58 3.2 Inferring phylogenies of evolving sequences without multiple sequence alignment ...... 65 3.2.1 Abstract ...... 65 3.2.2 Introduction ...... 65 3.2.3 Results ...... 66 3.2.4 Discussion ...... 78 3.2.5 Methods ...... 80 Simulated sequence data...... 80 3.2.6 References ...... 83 3.3 Concluding remarks ...... 89

Chapter 4: Can alignment-free approaches accurately be used to infer phylogenies of microbial genomes? ...... 90 4.1 Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer ...... 91 4.1.1 Abstract ...... 91 4.1.2 Introduction ...... 91 4.1.3 Results ...... 93 4.1.4 Discussion ...... 102 4.1.5 Methods ...... 104 4.1.6 References ...... 107 4.2 Concluding remarks ...... 112

Chapter 5: Alignment-free networks: the next microbial phylogenomics ...... 113 5.1 Recapitulating phylogenies using k-mers: from trees to networks ...... 114 5.1.1 Abstract ...... 114 5.1.2 Introduction ...... 114 5.1.3 Methods ...... 116 5.1.4 Results and discussion ...... 117 5.1.5 References ...... 121 5.2 K-mer similarity, networks of microbial genomes and taxonomic rank ...... 124 5.2.1 Abstract ...... 124 5.2.2 Introduction ...... 124 5.2.3 Results ...... 126 5.2.4 Discussion ...... 135 5.2.5 Methods ...... 137 5.2.6 References ...... 139 5.3 Concluding remarks ...... 142

CHAPTER 6: General discussion ...... 143

xiii 6.1 Phylogenetic inference of evolving sequences without multiple sequence alignment ...... 144 6.2 Microbial phylogenomics using alignment-free approaches ...... 145 6.3 An extended AF approach to study microbial evolution ...... 146 6.4 Conclusion and future directions ...... 147 6.5 References ...... 148

Appendix A: Supplementary document for Chapter Three ...... 150

Appendix B: Supplementary document for Chapter Four ...... 177

Appendix C: Supplementary document for Chapter Five ...... 192

xiv List of Figures and Tables

All figures and tables included within this thesis are contained within published or submitted manuscripts (Chapters Two, Three, Four and Five) and are numbered and referenced therein.

xv CHAPTER 1: INTRODUCTION

Sequence analysis is one of the most important disciplines in computational biology, as induced by the huge amount of data generated by molecular biology and genomics1. In phylogenetics, the sequence analysis takes place in an evolutionary context. The classical phylogenetic approaches based on multiple sequence alignment (MSA) are too slow to perform at large scales and scalable alignment-free (AF) methods are desirable to get the most out of this huge amount of data. Moreover, because microbial evolution does not follow a strict tree-like model as often depicted, a different representation, such as a network, might be more appropriate. Approaches based on k- mer comparison offer a promising alternative, but their performance and computational scalability in phylogenetic inference are little known, and have not been systematically investigated.

In this research, I studied AF approaches to the inference of phylogenomic relationships and propose an extended approach to infer similarity networks of microbial genomes. The AF approaches are robust against molecular events known to be rampant in microbial genomes, and scalable to large-scale data without loss of accuracy. The extended approach presented here addresses some of the limitations of the current classical approaches and provides an overview of microbial evolution using a large dataset of microbial genomes.

1.1 Microbial evolution

Microorganisms (or microbes) are the most abundant living organisms on the planet, across a wide range of environments. They are well-known for their dynamic evolution leading to their immense genetic and physiological diversity. Microbes include bacteria, archaea and protozoa as well as fungi, algae and some micro-animals. The extremely dynamic evolutionary regime of the microbes has permitted them to survive and adapt to their changing environments.

Microbes transmit genetic information not only vertically (from parent to offspring) but also laterally, from one organism to another2-4. Lateral genetic transfer (LGT), a phenomenon of non- lineal exchange of genetic material between two organisms, is central to the microbial way of life: up to 40% of gene families bear evidence of lateral transfer5. Its best-known manifestation is in

1 the spread of antibiotic resistance, surface adhesion, host range and antigenic properties creating “superbugs”, and in the diversity of core physiologies.

The large diversity of evolutionary outcomes related to genetic transfer is the result of a complex interplay between ecological and molecular factors. Ecological factors are those related to the selection and fixation of the transferred deoxyribonucleic acid (DNA), such as the interactions between organisms (e.g. competition, symbiosis), environmental conditions (e.g. physico- chemical conditions, carbon substrate availability), population size and intra-genetic diversity. Molecular factors, on the other hand, encompass processes that are directly related to the transfer and incorporation of DNA: (a) the mechanisms of transfer, (b) the mechanisms of incorporation and (c) the defence mechanisms of the host against foreign DNA. These are prevailing mechanisms that underlie LGT and environmental adaptation in microbes.

The key mechanisms responsible for the exchange of genetic material in microbes are conjugation, transduction and transformation. Microbial conjugation results in the transfer of genetic material (a plasmid) between two cells by direct cell-to-cell contact or by a bridge-like connection6,7; these plasmids may or may not be stably incorporated into the main microbial genome8. Transduction is the mechanism by which DNA is transferred from one bacterium to another by a bacteriophage9. Transformation is the genetic alteration of a cell resulting from the direct uptake and incorporation of exogenous genetic material (i.e. DNA) from its surroundings and through the membrane10.

LGT is not the only phenomenon impacting or driving microbial evolution. Microbial genomes are also subject to structural rearrangements including insertions, deletions, duplications, inversions or translocations11. An insertion is the addition of one or more nucleotide base pairs into a DNA sequence. Insertions can range in size from one base pair incorrectly inserted into a DNA sequence to a section of one chromosome inserted into another. On the other hand, a deletion is the loss of genetic material. Any number of nucleotides can be deleted, from a single base to an entire piece of a chromosome. A gene duplication event can be defined as any duplication of a region of DNA that contains a gene. Finally, an inversion is a rearrangement in which an internal segment of a chromosome has been reversed, and a translocation a rearrangement in which acentric fragments of two non-homologous chromosomes exchange positions.

In combination, all these mechanisms contribute to the complex dynamics of microbial genomes.

1.2 Classical phylogenetic approaches

The term phylogenetics derives from the Greek terms phyle (tribe) and genetikos (relative to birth). Thus, phylogenetics is the study of evolutionary relatedness among a set of taxa. Phylogenetic

2 analysis highlights the degree of relationship and evolution among taxa (e.g. among species) and groups of taxa within a population.

The pattern of inheritance is encoded in DNA. The degree of similarity among these sequences (and their encoded proteins) can thus be used as a measure of evolutionary relatedness12.

1.2.1 Phylogenetic inference & multiple sequence alignment The traditional phylogenetic approaches can be described in three steps. First, a set of homologous sequences (i.e. same gene across different taxa) is constructed. Second, (three or more) sequences are aligned using MSA based on the hypothesis of position-by-position homology, i.e. sequences are arranged regarding the similar (e.g., conserved) regions they have in common. Third, phylogenetic relationships (typically a tree) are inferred based on the MSA.

Depending on the chosen methods, phylogenies can be more or less accurate, and require a few seconds or weeks of computation. The purpose of an MSA algorithm is to gather alignments showing the biological relationship between several sequences. Most of these algorithms are based on the iterative progressive algorithm13. This greedy heuristic assembly algorithm is based on a simple principle: the sequences are incorporated one-by-one into the MSA with a pairwise alignment algorithm while following a guide tree (which consists of a rooted binary tree from unaligned sequences). The algorithm reiterates the loop (e.g., incorporates a new sequence), and the MSA and guide tree are re-estimated until convergence.

The most important part of the pairwise alignment algorithm is the scoring scheme it follows. These scoring schemes can be separated into two categories: (a) matrix-based and (b) consistency- based. Matrix-based algorithms such as ClustalW14, MUSCLE15 or Kalign16 are based on a substitution matrix used to estimate the cost a match between two symbols or profiled columns. However, this score depends only on the considered columns or their immediate neighbours. By contrast, consistency-based schemes integrate more information into the evaluation. These algorithms, which are a mix of the T-Coffee principle17 and the overlapping weights used by Dialign18, compile a collection of pairwise global and local alignments and use them as a position- specific substitution matrix during progressive alignment. Most studies indicate that consistency- based methods are more accurate than their matrix-based counterparts19, but their computational time is N times greater than simpler methods (where N is the number of sequences).

The most commonly used approaches used to infer phylogenies based on MSA are the approaches based on maximum parsimony20,21, maximum likehood22, Bayesian inference23 and clustering algorithms24-26.

3 The aim of the approaches based on maximum parsimony is to have a tree topology that minimises the cost of changes for a given alignment. This approach is based on the assumption that a change requiring the minimum number of steps (i.e., the most parsimonious solution), from phenotype A to B, is the most plausible explanation. Because the number of possible topologies can be huge, the tree searching part of this approach is heuristic.

The maximum likelihood method uses standard statistical techniques for inferring probability distributions to identify the tree on which the data are most likely, given a model of sequence change. A substitution model is required to assess the probability of particular mutations; approximately, a tree that requires more mutations at interior nodes to explain the observed phylogeny will be assessed as being a less likely fit to the data. The method also requires that evolution at different sites and along different lineages must be statistically independent. Because it formally requires a search of all possible combinations of tree topology and branch length, it is computationally expensive to perform on more than a few sequences27. Due to this computational demand, maximum likelihood is often implemented with approximations. But only the tree- searching component is truly heuristic.

The Bayesian methods differ from the maximum likelihood methods as they give a posterior probability of the tree, according to the data, a prior, and a model. Implementations of Bayesian methods generally use Markov chain Monte Carlo sampling algorithms, which are beyond the scope of this project and not described here.

Finally, the approaches based on clustering algorithms can give reasonably accurate phylogeny from a given distance matrix while being faster than the methods described above. For example, the principle of the neighbour-joining algorithm is to find the pair of operational taxonomic units (OTUs; which, initially are the sequences themselves) from a given distance matrix, which minimises the sum of the branch lengths from an intermediate tree. The pair of OTUs is then considered as a new OTU. Finally, the distance matrix is recomputed considering the new OTU. After the construction of the topology, neither the MSA nor the distance matrix is used any longer.

1.2.2 Phylogenetic trees Phylogenetic relationships are commonly depicted as a tree (Figure 1). A phylogenetic tree, which can be rooted or unrooted, is composed of nodes, internal or external, and edges which represent the branches and taxa (leaves or external edges).

A tree represents the relationships in a group of taxa (here, in the form of DNA or protein sequences) and the key assumption is that all taxa within this tree share a common ancestor (i.e. they are homologous). Based on the tree topology only (e.g. without consideration of branch

4 lengths), the less two taxa are separated by nodes and edges, the more two sequences are evolutionary perceived to be related to each other. An internal node represents an inferred common ancestor of all the taxa that descend from it. The branch length characterises the evolutionary divergence between the two linked nodes, and is usually represented as the inferred number of substitutions per site. The root is the position on the tree of the common ancestor for all the taxa. It is important to keep in mind that the tree is always a hypothesis and, except when based on simulated data or serial transfer experiments28, cannot be directly verified because it represents past evolutionary events.

Figure 1. A current view of the tree of life, encompassing the total diversity represented by sequenced genomes29.

5 The different regions of DNA, the different genes and their coded proteins, do not accept mutations at the same rate because of the various evolutionary constraints on different genes and non-coding DNA regions, including cis-regulatory motifs. For instance, regions of ribosomal RNA (16S or 23S for bacteria, 18S or 28S for eukaryotes) are more highly conserved compared to genes coding for metabolic functions30,31. For a given protein however the evolutionary rate can be modelled as constant across evolutionary time and among the different lineages.

1.2.3 Phylogenetic networks Since Darwin’s work32, the tree model has been widely used to describe the evolution of organisms, but common events in microbial genomes, such as LGT or rearrangement, are not tree- like in nature (Figure 2). In the recent years, the tree structure has been shown to not adequately represent microbial evolution and a network structure is perceived to be more realistic33,34.

Figure 2. A phylogenetic tree, reticulated by lateral branches to form a network35.

Networks have been used to describe patterns among biological organisms since the mid-1700s36 and started to be applied to represent molecular phylogenetic signal in the 1990s37-39. In a phylogenetic network, each node represents an entity (e.g. a gene, a genome or a mobile genetic element) and edges represent evidence of shared genetic material between two nodes.

Different types of phylogenetic networks can be used in a phylogenetic analysis. A phylogenetic tree can be considered as a strict form of network in which only vertical inheritance signal is present. Similarly, if only lateral transmission signal is captured the network is an LGT network40.

6 If the entities in the networks are genes and the edges represent percent similarity between the sequences we call it a sequence-similarity network41,42. If the nodes represent a genome and the edges evidence of share genes, it is called a genome network43,44.

In genome networks, all entities are not necessarily genealogically related, allowing for simultaneous analysis of mobile genetic elements, e.g. plasmids or viruses, and cellular evolution. In that respect, a network can be more inclusive than a tree, which is restricted to one type of relationship between a single fraction of the biological diversity45. Usually a genome network does not show the genes contributing to the connection between two nodes, but multiplex graphs have been developed to display the identity of the genes associated with each edges46. However, the visualisation of these networks can be difficult and other graphs, such as the “gene families- genomes” bipartite graphs, have been recently developed to analyse gene transmission34.

1.3 Alignment-free phylogenetic approaches

So-called alignment-free (AF) approaches, an alternative to MSA, are among the more-recent tools being developed in computational biology. Most AF approaches are based on methods that involve (a) the decomposition of sequences into subsequences (known as words, sub-words or k-mers), and (b) the comparison of sets of these fragments using different formulae to derive pairwise distances between the sequences. These distances could be considered as evolutionary distances, which can then be used for phylogenetic inference.

Following Haubold47, these methods can be classified broadly into those based on word (k-mer) count, and those based on match length. Certain other AF methods may fit partially into these classes, or do not fit in any of them48. In this research, I focused on nine different AF methods representing the two major classes.

The first set of methods are based on word count, and compare k-mer profiles to infer phylogenetic 49 distances. The !" statistics generate a score for each possible pair of sequences within a set based on k-mer counts. These scores can be transformed via logarithmic representation of the geometric mean to generate a distance. The feature frequency profiles (ffp) method50-52 builds the k-mer frequency profile for each sequence, and then uses the Jensen-Shannon divergence50 to compare their profiles and generate a distance between the sequences. The composition vector method shares the same principle as ffp, but the k-mer frequencies are divided by the frequencies expected by chance alone53. The co-phylog method counts the proportion of shared words that differ in the middle position as a distance between two sequences, i.e. a mutation point surrounded by a context of certain length present in both sequences54. Finally, the spaced-word frequencies method uses

7 sets of match and mismatch position patterns to compare the spaced-word frequencies between two sequences. The patterns can vary in length and number55,56.

The second set of methods is based on match length and instead of counting words of fixed length as before, are based on the length of exact (or inexact) matches between pairs of sequences. The grammar-based method uses a set of rules to decompose a string into substrings, e.g. GGTTTAA decomposed as G2T3A2. The idea is that closely related sequences are more compressible than divergent sequences57. Different compression schemes are available, such as the Lempel-Ziv factorisation47. The average common substring (acs) method58 uses the concept of matching statistics59. Instead of decomposing the concatenation of two sequences, this method searches for the longest match in sequence A starting at every position in sequence B. Unlike the Lempel-Ziv factorisation method, here the longest matches can overlap. The shortest unique substring method, instead of looking for the longest matches between two sequences, looks for the longest common substrings extended by one, known as the longest substrings between two mutations, i.e. the SHortest Unique subSTRING, shustring60. Finally, the k-mismatch average common substring method is a variant of the acs method, using the longest common substring with k-mismatches56,61.

In Chapter 2 I present a comprehensive review on applications of AF approaches in phylogenomics, and in Chapters 3 and 4, I describe in greater details on the nine AF methods used in this research.

1.4 Research problem

For decades phylogenetic approaches based on MSA have been the standard approach to infer phylogenies. But MSA is not without limitations. MSA, based on heuristic algorithms62, provides alignment scores for which relevance to homology can be difficult to assess statistically63. Moreover, these approaches can be computationally intensive, particularly on large datasets (i.e. at genome scale). MSA is also complicated by local structure variation (such as genetic recombination, duplication, gene gain and loss) yielding ambiguously aligned regions, or not aligned at all (Figure 3).

8

Figure 3. Simplified workflow of phylogenomic approaches64. A simplified workflow of (A) the classical and (B) an alternative phylogenetic approaches, given a set of homologous sequences and reference phylogeny shown on top. The difference between the two resulting phylogenetic trees is highlighted in red.

To overcome problems encountered by the classical phylogenetic approaches, AF methods have started to be used in phylogenetic analysis. By decoupling homology signal from full-length sequence contiguity, these methods avoid the computational complexity of MSA while capturing signal otherwise lost to insertions/deletions, recombination, or shuffling (Figure 3). Compared to MSA-based phylogenetic approaches, AF phylogenetic approaches are still in their infancy and their performance is little known. Studies have shown that these methods can be used at gene or genome scale and that they are potentially scalable, but no studies have statistically assessed their scalability. In the same way, the sensitivity of these AF approaches to different biological scenarios remains to be systematically investigated.

Most of the AF approaches have been developed to infer phylogenetic trees, but as I have discussed previously, microbial evolution does not follow a simple tree-like structure. In order to infer biologically more-realistic phylogenies of microbial genomes it is necessary to expand the applications of AF approaches to phylogenetic inference beyond trees.

9 1.5 Aims

1. To examine the robustness of alignment-free approaches to biological scenarios at gene level.

2. To examine the sensitivity of alignment-free approaches to evolutionary scenarios at genome level.

3. To assess the scalability of alignment-free approaches.

4. To improve phylogenomic inference using an extended AF approach.

1.6 Thesis outline

The second chapter is a literature review of the principles and applications of AF approaches in phylogenomic inference using genome sequences. I start by presenting the difficulties encountered by classical phylogenomic approaches based on MSA. Then I describe the alternative AF approaches, focusing on k-mer count methods, and their application for phylogenetic inference. Finally, I review the application of AF approaches to detect LGT.

The following three chapters include the steps I undertook to study the performance of AF approaches to infer phylogenetic relationship and develop new AF approaches. First, I focused on the robustness and scalability of AF approaches to different biological scenarios at gene-level, using both simulated and empirical data. I compared the performance of MSA-based and AF approaches against among-site rate heterogeneity, compositional biases, genetic rearrangements, insertions/deletions, sequence divergence and sequence truncation. I used three AF methods, based on the !" statistics, and developed another one using k-mer neighbourhood comparison. The results of the comparison between MSA-based and AF approaches are presented as an article in Chapter 3.

Second, I systematically assessed the sensitivity of nine AF methods to genome-scale evolutionary events including divergence, LGT and rearrangement. These methods represent the two AF families, those based on word counts and those based on match lengths. I also examined and compared the scalability of these methods at genome scale. Finally, I introduced a new application of the jackknife technique to provide node-support values to phylogenies inferred by AF approaches. A review of the AF methods used in this study and my findings are presented as an article in Chapter 4.

Third, I implemented an AF approach to infer phylogenomic networks of microbial genomes. I applied this approach to a large dataset of microbial genomes, and systematically assessed the impact of ribosomal RNAs and plasmid sequences in this network. I implemented a distance

10 threshold that was used to captured changes in the network structure (cliques) that reflect the evolutionary dynamics of microbial genomes. Finally, I linked the implicated k-mers to annotated genomic regions (e.g. gene functions) using a database approach, and defined the term core k- mers. This work is presented as an article in Chapter 5.

I conclude the thesis with high-level discussions of my work and its impact on the research field, limitations of the existing scope of study, and perspectives for future research.

1.7 References

1 Gross, M. Riding the wave of biological data. Curr Biol 21, R204-206, doi:10.1016/j.cub.2011.03.009 (2011).

2 Lawrence, J. G. & Ochman, H. Amelioration of bacterial genomes: rates of change and exchange. J Mol Evol 44, 383-397 (1997).

3 Lawrence, J. G. & Ochman, H. Molecular archaeology of the Escherichia coli genome. Proc Natl Acad Sci U S A 95, 9413-9417 (1998).

4 Nelson, K. E. et al. Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima. Nature 399, 323-329, doi:10.1038/20601 (1999).

5 Ragan, M. A. Detection of lateral gene transfer among microbial genomes. Curr Opin Genet Dev 11, 620-626 (2001).

6 Holmes, R. K. & Jobling, M. G. in Medical Microbiology (ed S. Baron) (1996).

7 Vulic, M., Lenski, R. E. & Radman, M. Mutation, recombination, and incipient speciation of bacteria in the laboratory. Proc Natl Acad Sci U S A 96, 7348-7351 (1999).

8 Skippington, E. & Ragan, M. A. Lateral genetic transfer and the construction of genetic exchange communities. FEMS Microbiol Rev 35, 707-735, doi:10.1111/j.1574- 6976.2010.00261.x (2011).

9 Evans, T. J. et al. Characterization of a broad-host-range flagellum-dependent phage that mediates high-efficiency generalized transduction in, and between, Serratia and Pantoea. Microbiology 156, 240-247, doi:10.1099/mic.0.032797-0 (2010).

10 Chen, I. & Dubnau, D. DNA uptake during bacterial transformation. Nat Rev Microbiol 2, 241-249, doi:10.1038/nrmicro844 (2004).

11 Darling, A. E., Miklos, I. & Ragan, M. A. Dynamics of genome rearrangement in bacterial populations. PLoS Genet 4, e1000128, doi:10.1371/journal.pgen.1000128 (2008).

11 12 Zuckerkandl, E. & Pauling, L. Molecules as documents of evolutionary history. J Theor Biol 8, 357-366 (1965).

13 Hogeweg, P. & Hesper, B. The alignment of sets of sequences and the construction of phyletic trees: an integrated method. J Mol Evol 20, 175-186 (1984).

14 Thompson, J. D., Higgins, D. G. & Gibson, T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673-4680 (1994).

15 Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32, 1792-1797, doi:10.1093/nar/gkh340 (2004).

16 Lassmann, T. & Sonnhammer, E. L. Kalign, Kalignvu and Mumsa: web servers for multiple sequence alignment. Nucleic Acids Res 34, W596-599, doi:10.1093/nar/gkl191 (2006).

17 Notredame, C., Higgins, D. G. & Heringa, J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302, 205-217, doi:10.1006/jmbi.2000.4042 (2000).

18 Morgenstern, B., Dress, A. & Werner, T. Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc Natl Acad Sci U S A 93, 12098-12103 (1996).

19 Blackshields, G., Wallace, I. M., Larkin, M. & Higgins, D. G. Analysis and comparison of benchmarks for multiple sequence alignment. In Silico Biol 6, 321-339 (2006).

20 Swofford, D. L. & Berlocher, S. H. Inferring evolutionary trees from gene frequency data under the principle of maximum parsimony. Systematic Biology 36, 293-325 (1987).

21 Kolaczkowski, B. & Thornton, J. W. Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature 431, 980-984, doi:10.1038/nature02917 (2004).

22 Felsenstein, J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17, 368-376 (1981).

23 Huelsenbeck, J. P., Ronquist, F., Nielsen, R. & Bollback, J. P. Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294, 2310-2314, doi:10.1126/science.1065889 (2001).

24 Opperdoes, F. Construction of a distance tree using clustering with the Unweighted Pair Group Method with Arithmatic Mean (UPGMA). Publication online. URL: http://www/. icp. ucl. ac. be/opperd/private/upgma. html (1997).

12 25 Fitch, W. M. & Margoliash, E. Construction of phylogenetic trees. Science 155, 279-284 (1967).

26 Saitou, N. & Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4, 406-425 (1987).

27 Felsenstein, J. & Felenstein, J. Inferring phylogenies. Vol. 2 (Sinauer associates Sunderland, 2004).

28 Blount, Z. D., Borland, C. Z. & Lenski, R. E. Historical contingency and the evolution of a key innovation in an experimental population of Escherichia coli. Proc Natl Acad Sci U S A 105, 7899-7906, doi:10.1073/pnas.0803151105 (2008).

29 Hug, L. A. et al. A new view of the tree of life. Nat Microbiol 1, 16048, doi:10.1038/nmicrobiol.2016.48 (2016).

30 Chan, C. X., Darling, A. E., Beiko, R. G. & Ragan, M. A. Are protein domains modules of lateral genetic transfer? PLoS One 4, e4524, doi:10.1371/journal.pone.0004524 (2009).

31 Ragan, M. A., Bernard, G. & Chan, C. X. Molecular phylogenetics before sequences: oligonucleotide catalogs as k-mer spectra. RNA Biol 11, 176-185, doi:10.4161/rna.27505 (2014).

32 Darwin, C. & Bynum, W. F. The origin of species by means of natural selection: or, the preservation of favored races in the struggle for life. (2009).

33 Bapteste, E. et al. Prokaryotic evolution and the tree of life are two different things. Biol Direct 4, 34, doi:10.1186/1745-6150-4-34 (2009).

34 Corel, E., Lopez, P., Meheust, R. & Bapteste, E. Network-Thinking: Graphs to Analyze Microbial Complexity and Evolution. Trends Microbiol 24, 224-237, doi:10.1016/j.tim.2015.12.003 (2016).

35 Doolittle, W. F. Phylogenetic classification and the universal tree. Science 284, 2124-2129 (1999).

36 Ragan, M. A. Trees and networks before and after Darwin. Biol Direct 4, 43; discussion 43, doi:10.1186/1745-6150-4-43 (2009).

37 Hein, J. A heuristic method to reconstruct the history of sequences subject to recombination. Journal of Molecular Evolution 36, 396-405 (1993).

38 Gusfield, D., Eddhu, S. & Langley, C. Efficient reconstruction of phylogenetic networks with constrained recombination. Proc IEEE Comput Soc Bioinform Conf 2, 363-374 (2003).

13 39 Gusfield, D. Optimal, efficient reconstruction of root-unknown phylogenetic networks with constrained and structured recombination. Journal of Computer and System Sciences 70, 381- 398 (2005).

40 Cong, Y., Chan, Y. B., Phillips, C. A., Langston, M. A. & Ragan, M. A. Robust Inference of Genetic Exchange Communities from Microbial Genomes Using TF-IDF. Front Microbiol 8, 21, doi:10.3389/fmicb.2017.00021 (2017).

41 Jachiet, P. A., Colson, P., Lopez, P. & Bapteste, E. Extensive gene remodeling in the viral world: new evidence for nongradual evolution in the mobilome network. Genome Biol Evol 6, 2195-2205, doi:10.1093/gbe/evu168 (2014).

42 Atkinson, H. J., Morris, J. H., Ferrin, T. E. & Babbitt, P. C. Using sequence similarity networks for visualization of relationships across diverse protein superfamilies. PLoS One 4, e4345, doi:10.1371/journal.pone.0004345 (2009).

43 Dagan, T., Artzy-Randrup, Y. & Martin, W. Modular networks and cumulative impact of lateral transfer in genome evolution. Proc Natl Acad Sci U S A 105, 10039-10044, doi:10.1073/pnas.0800679105 (2008).

44 Halary, S., Leigh, J. W., Cheaib, B., Lopez, P. & Bapteste, E. Network analyses structure genetic diversity in independent genetic worlds. Proc Natl Acad Sci U S A 107, 127-132, doi:10.1073/pnas.0908978107 (2010).

45 Bapteste, E. The origins of microbial adaptations: how introgressive descent, egalitarian evolutionary transitions and expanded kin selection shape the network of life. Front Microbiol 5, 83, doi:10.3389/fmicb.2014.00083 (2014).

46 Yutin, N., Raoult, D. & Koonin, E. V. Virophages, polintons, and transpovirons: a complex evolutionary network of diverse selfish genetic elements with different reproduction strategies. Virol J 10, 158, doi:10.1186/1743-422X-10-158 (2013).

47 Haubold, B. Alignment-free phylogenetics and population genetics. Brief Bioinform 15, 407-418, doi:10.1093/bib/bbt083 (2014).

48 Vinga, S. & Almeida, J. Alignment-free sequence comparison-a review. Bioinformatics 19, 513-523 (2003).

49 Wan, L., Reinert, G., Sun, F. & Waterman, M. S. Alignment-free sequence comparison (II): theoretical power of comparison statistics. J Comput Biol 17, 1467-1490, doi:10.1089/cmb.2010.0056 (2010).

14 50 Sims, G. E., Jun, S. R., Wu, G. A. & Kim, S. H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Natl Acad Sci U S A 106, 2677- 2682, doi:10.1073/pnas.0813249106 (2009).

51 Sims, G. E. & Kim, S. H. Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs). Proc Natl Acad Sci U S A 108, 8329-8334, doi:10.1073/pnas.1105168108 (2011).

52 Jun, S. R., Sims, G. E., Wu, G. A. & Kim, S. H. Whole-proteome phylogeny of by feature frequency profiles: An alignment-free method with optimal feature resolution. Proc Natl Acad Sci U S A 107, 133-138, doi:10.1073/pnas.0913033107 (2010).

53 Wang, H., Xu, Z., Gao, L. & Hao, B. A fungal phylogeny based on 82 complete genomes using the composition vector method. BMC Evol Biol 9, 195, doi:10.1186/1471-2148-9-195 (2009).

54 Yi, H. & Jin, L. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res 41, e75, doi:10.1093/nar/gkt003 (2013).

55 Leimeister, C. A., Boden, M., Horwege, S., Lindner, S. & Morgenstern, B. Fast alignment- free sequence comparison using spaced-word frequencies. Bioinformatics 30, 1991-1999, doi:10.1093/bioinformatics/btu177 (2014).

56 Horwege, S. et al. Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res 42, W7-11, doi:10.1093/nar/gku398 (2014).

57 Russell, D. J., Way, S. F., Benson, A. K. & Sayood, K. A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences. BMC Bioinformatics 11, 601, doi:10.1186/1471-2105-11-601 (2010).

58 Ulitsky, I., Burstein, D., Tuller, T. & Chor, B. The average common substring approach to phylogenomic reconstruction. J Comput Biol 13, 336-350, doi:10.1089/cmb.2006.13.336 (2006).

59 Gusfield, D. Algorithms on strings, trees and sequences: computer science and computational biology. (Cambridge university press, 1997).

60 Haubold, B., Pierstorff, N., Moller, F. & Wiehe, T. Genome comparison without alignment using shortest unique substrings. BMC Bioinformatics 6, 123, doi:10.1186/1471-2105-6-123 (2005).

61 Leimeister, C. A. & Morgenstern, B. Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics 30, 2000-2008, doi:10.1093/bioinformatics/btu331 (2014).

15 62 Notredame, C. Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol 3, e123, doi:10.1371/journal.pcbi.0030123 (2007).

63 Mitrophanov, A. Y. & Borodovsky, M. Statistical significance in biological sequence analysis. Brief Bioinform 7, 2-24 (2006).

64 Chan, C. X. & Ragan, M. A. Next-generation phylogenomics. Biol Direct 8, 3, doi:10.1186/1745-6150-8-3 (2013).

16 CHAPTER 2: ALIGNMENT-FREE INFERENCE OF HIERARCHICAL AND RETICULATE PHYLOGENOMIC RELATIONSHIPS

This chapter is a literature review describing the concept of alignment-free approach applied to infer phylogenetic relationships, presented in a form of a manuscript. This work was published in the journal Briefings in Bioinformatics (doi:10.1093/bib/bbx067) and has been edited for this thesis. As the first author of this paper, I helped write and edit the manuscript and created the figures.

2.1 Alignment-free inference of hierarchical and reticulate phylogenomic relationships

2.1.1 Abstract We are amidst not only an ongoing flood of sequence data arising from the application of next- generation technologies, but also a fundamental revision in our understanding of how genomes evolve individually and within the biosphere. Workflows for phylogenomic inference must accommodate data that are not only much larger than before, but often more error-prone and perhaps misassembled, or not assembled in the first place. Moreover, genomes of microbes, viruses and plasmids evolve not only by treelike descent with modification, but also by incorporating stretches of exogenous DNA. Thus next-generation phylogenomics must address computational scalability while re-thinking the nature of orthogroups, the alignment of multiple sequences, and the inference and comparison of trees. New phylogenomic workflows have begun to take shape based on so-called alignment-free approaches. Here we review the conceptual foundations of alignment-free phylogenetics for the hierarchical (vertical) and reticulate (lateral) components of genome evolution, reflecting on what seems to be successful and where further development is needed.

2.1.2 Introduction Phylogenomics refers to an important body of theory, methodology and tools applicable to the comparative analysis of genome-scale data within an evolutionary context1-4. The field builds on molecular phylogenetics, which since the early 1960s has been developed to elucidate genealogical

17 relationships and evolutionary processes within families of genes or proteins. As one the first areas of molecular bioscience to develop an explicitly algorithmic approach, and drawing richly on statistics and computational science, phylogenetics is considered a major area within bioinformatics.

By definition, the inference of genealogical relationships must be based on homologous characters. Even before molecules could be fully sequenced, it was known that certain oligopeptides were common to representatives (in different biological species) of individual proteins, e.g. a- haemoglobin or insulin; similarly, 16S ribosomal RNAs in different species shared sets of short oligonucleotides5. Indeed, the presence of shared identity in sequences beyond the extent required to deliver conserved function was taken as evidence for homology6. As full-length sequences became available, it made sense to identify and display these conserved regions in a multiple sequence alignment (MSA)7,8. Molecular phylogenetics is thus endowed with a richness and precision rarely seen with phenetic characters: homology is no longer “overall” and subjective, but can be evidenced column-by-column along a set of aligned sequences. Thus an MSA is an explicit position-by-position hypothesis of homology. Not coincidentally, an MSA matrix serves as a convenient input to software programs that calculate pairwise dissimilarities (for distance methods) or compute a tree that best explains the patterns in the aligned columns, given a model of sequence change over time (in e.g. parsimony or likelihood methods).

With the advent and spread of genomics, very large datasets are becoming available for phylogenetic inference, with sequences longer (genomes rather than single genes or proteins) and much more numerous. It is no longer unusual to encounter datasets with thousands of genome or (concatenated) exome sequences. Phylogenomics at this scale requires organised data management, significant computational power and large memory. However, phylogenomics can present challenges other than those arising purely from size and scale: the data may be of low quality, assembly may be poor or nonexistent, the sequences may not be co-linear over their entire length, different models of sequence change probably apply (e.g. to protein-coding and non-coding regions), and sequence regions may have different origins and evolutionary histories9. We consider these factors in turn.

2.1.3 Critique of classical (alignment-based) phylogenetics In the early days of genomics, much effort went into “finishing and polishing” – joining contigs and resolving conflicts to recover full-length chromosomes nearly free of ambiguities or errors. Thanks to this body of earlier work, key information (e.g. presence or absence of a gene or pathway) can now often be obtained simply by deep sequencing followed by a rough assembly. Depending on one’s scientific goals, the breadth-versus-depth tradeoff can be pushed dramatically

18 toward breadth, at the expense of data quality. Survey projects now target tens of thousands of bacterial genomes, few of which will assemble into a single contig, while large eukaryote genomes can be approached through transcriptomics, with consequences for MSA including the need to deal with truncated sequences and alternative splice forms. Indeed, there is optimism that phylogenetic trees might be inferred entirely without assembly10.

Basic MSA requires sequences to be collinear, i.e. to preserve a common ancestral order of elements. Depending on the sequences being compared, these elements may be (for example) nucleotides, codons, amino acids, domains, exons or genes. Non-collinearity may arise due to poor sequence quality or misassembly, but could also be real, particularly at whole-genome scale. Across bacterial genomes, gene order tends to be poorly conserved except among close relatives; exceptions include ribosomal RNA operons, and genes encoding some ribosomal proteins. Thus even with correctly assembled genomes, a separate bioinformatic step is required to match putatively homologous regions prior to, or as part of, MSA. Like MSA software, whole-genome aligners take different algorithmic approaches and implement different assumptions and tradeoffs11,12 but in general are CPU- and memory-intensive, with considerable scope for error and ambiguity arising e.g. from families of repetitive elements, low-complexity regions and paralogs.

Most approaches to phylogenetic inference require a statistical model of sequence evolution13. It is not difficult to imagine that different classes of sequence (e.g. those encoding a protein, or a functional RNA, or no product at all) are best described by different models. Even within a single gene, inference quality may be improved by applying different rate classes or steady-state assumptions, e.g. for DNA regions that encode highly structured versus unstructured regions of proteins, or stems versus loops of ribosomal RNAs. It is computationally onerous to identify, delineate and group these regions, match each to the best model and optimise parameter values. Scalability to genome-scale data would be facilitated by simplifying these models, using a single generic model or eliminating them altogether.

In the standard Darwinian model, genomes are inherited vertically from one generation to the next within lineages. To a first approximation, this vertical inheritance adequately describes the evolution of nuclear genomes, between species, of morphologically complex organisms including animals and plants. However, genomes of bacteria, archaea, protists, viruses and plasmids often contain, in addition, stretches of DNA acquired laterally from unrelated organisms, or from the environment. Many studies indicate that 10-40% of genes in some bacterial genomes, and essentially all gene families in bacteria, have been affected by lateral (horizontal) genetic transfer (LGT or HGT)14,15 (and references therein). For such genomes, a phylogenomic workflow must distinguish vertical from lateral signal, and treat each separately. To complicate matters, neither

19 genes nor domains (genomic regions corresponding to protein domains) are privileged units of LGT16,17; new lateral events can overwrite older ones; regions of lateral origin may ameliorate, i.e. evolve to become indistinguishable from their new host genome18; and older lateral regions will themselves be inherited vertically within subtrees15,19. In MSA-based phylogenomics these issues are addressed by adding further (computationally difficult) steps to the workflow, e.g. inferring an organismal reference tree and comparing features of its topology with those of individual gene- or protein-family trees14,20. Opportunities abound for complications to arise from cryptic paralogy, or inappropriate delineation of the units of analysis.

Based on the above, we might summarise our wish list for next-generation phylogenomics9: it must be based on homologous signal (a different subset of signal for each evolutionary origin), while avoiding the assumptions inherent in MSA (predefined fixed units of analysis; co-linearity). It should incorporate a generic, computationally simple substitution model; be highly scalable to large data, yet robust to low data quality; and perhaps support phylogenetic inference from unassembled sequence reads. Alignment-free (AF) methods offer considerable promise against each of these goals9.

2.1.4 Alignment-free methods and k-mers AF methods underpin key algorithms in diverse areas of bioinformatics including database searching21,22, sequence clustering23, error correction in sequencing reads24, genome assembly25, discriminative prediction of regulatory variants26,27 and testing for genetic recombination28. In phylogenetics and phylogenomics, AF methods offer alternatives to the assumptions and computational demands of MSA identified above. Following Haubold29, AF methods can be classified broadly into those based on word (k-mer) count, and those based on match length. Certain other AF methods may fit uncomfortably into these classes, or lie outside them altogether30. In the present context, the motivating concept is the same: substrings (perhaps defined by k-mers) that meet certain criteria, and are shared by a set of sequences, can be considered as capturing part of the homology signal and are thus potentially informative on phylogeny. Here we focus primarily on k-mer count methods.

Substrings (sub-sequences) of defined length are variously known as words, k-mers or n-grams, with k or n denoting the substring length. By disallowing mismatch, degeneracy and indels, k-mer statistics become simpler and the computation more efficient. These strictures may be slightly relaxed (e.g. by allowing limited mismatch to deal with noise) or avoided in part (by recoding into a reduced alphabet), albeit at the risk of crossing into the realm of pattern or motif analysis, for which very different and computationally less-favourable methods are required.

20 Any molecular sequence can be represented as the set of its constituent k-mers (Figure 1). These k-mers are typically allowed to overlap with stride = 1; a larger stride ≤ k could be used to reduce the computational effort. Whereas in MSA the linear order of sequence elements is fundamental to recognising conserved (homologous) positions and identifying conservation profiles, the analogous concept in AF is an order-less matching of k-mers, i.e. an intersection of k-mer sets. For sufficiently large k, any given k-mer is approximately unique to a sequence31, so in the absence of extenuating circumstances (e.g. strong mutational bias, low-complexity regions) shared instances of that k-mer are likely to be homologous. As sequences progressively diverge on a tree, they share fewer k-mers in common, and the longest k-mer they share tends to be shorter32. As we discuss below, these measures can be used to estimate a pairwise distance. For this, the count or frequency of shared k-mers seems to be sufficient, i.e. it is not necessary to keep track of positional information33 unless we wish to map specific k-mers (e.g. those inferred to have a lateral origin) to genes, structures or functions34-36.

Seq1 Seq2 ACGTTTCAGAAACTAGCTAGGT ACGTTTCATGGAGTTTAATCGA ACGTTTC ... ACGTTTC ... CGTTTCA 7-mers... TAGCTAG CGTTTCA 7-mers... TTTAATC GTTTCAG AGCTAGG GTTTCAT TTAATCG ... GCTAGGT ... TAATCGA

A Exact ACGTTTC ACGTTTC matched matches CGTTTCA CGTTTCA matched GTTTCAG GTTTCAT not matched ......

B Inexact ACGTTTC ACGTTTC matched matches CGTTTCA CGTTTCA matched (mismatch = 1) GTTTCAG GTTTCA T matched ......

C Filtered RYRYYYY RYRYYYC matched feature sets YRYYYYR YRYYYYR matched R = A or G RYYYYRR RYYYYRY not matched Y = C or T ......

D Seq1 ACGTTTCAGAAACTAGCTAGGT Spaced Seq2 ACGTTTCATGGAGTTTAATCGA words Pattern 1100010 (length = 7)

Figure 1. Fundamental concepts and nomenclature of k-mers, illustrated here for overlapping k- mers (k = 7, stride = 1) in two DNA sequences. (A) exact matches, (B) inexact matches, (C) degenerate bases, and (D) a binary pattern of match and non-match positions (spaced word matches).

21 By contrast, conservation profiles measure local sequence similarity and thus require at least approximate positional information. Classical alignment algorithms such as Smith-Waterman or Needleman-Wunsch employ dynamic programming to determine columns and/or blocks of matching residues in a set of sequences. From the number or proportion of conserved residues within a column, a conservation profile can be derived. AF conservation profiles can be constructed by plotting the maximum number of matching k-mers over their mean positions within the set of sequences37. While AF conservation profiles do not achieve single-residue resolution, they are faster to compute than classical profiles (linear versus quadratic time with respect to sequence length) yet still allow the identification of conserved regions or domains, even if non- collinear37.

Extraction of k-mer sets from molecular sequences is in principle trivial, accomplished simply by sliding a window of size k over a string representation of the sequence to produce a lexicon of overlapping k-mers (Figure 1). In application to DNA, it is usual to record only canonical k-mers (the lexicographically smaller of a k-mer and its reverse complement), or to work with only the forward strand. Efficient accumulation of k-mer counts requires that we determine the novelty of each k-mer as it appears: previously encountered k-mers must be identified rapidly and their counts incremented, while novel k-mers must be inserted quickly and without impairing the performance of the data store during subsequent queries. Approaches to this task may be categorized broadly into those based on hashing, and those relying more directly on data structures invented for string lookup and spelling correction, notably the suffix tree38 and the suffix array39.

Exact hashing methods offer constant-time insertion and lookup, and their use in bioinformatics has a long history. Naïve hashing, however, proves surprisingly slow40 and requires memory linear in the number of distinct k-mers, which in turn is exponential in the size of the k-mers: ! " # , where |"| is the size of the alphabet. Jellyfish (Marçais & Kingsford 2011) overcomes many of these problems through careful design of a lock-free multithreaded hash table, using a key encoding and bit packing to ensure far lower memory usage. Tessel, part of the Blue read- correction package41, markedly reduces memory requirements for sequence reads by excluding singleton k-mers until a second occurrence is observed. Even for modest coverage, genuine k-mers will occur many times, with singletons almost certainly the result of sequencing error. Confirmed k-mers are recorded in optimized partitioned lock-free hash tables, with a subsequent merge phase to ensure accurate final counts. Melsted and Pritchard42 address memory usage via a Bloom filter43, a probabilistic distributed hashing scheme which admits a small chance of a false positive match. A standard hash table is used to store entries seen twice or more, but the implementation remains

22 single-threaded and is not competitive. Hybrids of this nature have been used recently in the context of de Bruijn graphs44.

Suffix trees represent a string through its underlying suffixes, each encoded as a path from the root to a leaf, with the start position of the suffix within the string stored in this terminating node. Suffix arrays contain these same start positions, but arranged according to the lexicographical order of the suffixes included. For both, construction time and space are linear in sequence length, hence in the number of k-mers, but the array requires far fewer bits per suffix, perhaps half the footprint of the tree39. Lookup is linear in the length of the query, here ! & , for suffix trees and enhanced suffix arrays45. Suffix trees have long been applied in k-mer matching (e.g. in Mummer 1.046), while k-mer counters based on suffix arrays have included Meryl (part of the Celera Assembler)47 and Tallymer48, the latter enhanced by storing the longest common prefix among suffix groups. While the optimized hashing methods discussed above appear superior for general k-mer counting and retrieval, suffix data structures are a far better choice for error correction and approximate matching. Even so, these tasks may be prohibitively expensive for very large datasets.

2.1.5 Phylogenetic inference based on k-mers As we have mentioned, as sequences diverge over time from a common ancestor they will come to share fewer, and shorter, k-mers. More precisely: given a threshold τ such that k-mers of length k ≥ τ occurring in related sequences can be considered homologous, as these sequences diverge (a) for a fixed k ≥ τ the number of shared k-mers will tend to decrease, (b) over all k ≥ τ the mean length of shared k-mers will tend to decrease, and (c) the longest shared k-mer will tend to be shorter. The optimal τ is likely to be problem- and data-dependent (see below), but could be selected based on the distribution of match lengths in simulated sequences49, e.g. to maximise the area under the ROC curve or ensure a minimum desired frequency of true positives. Measures that capture these trends behave as pairwise similarities, and like their classical MSA-based counterparts can be used in distance analysis to generate a tree50-58.

21,51,59-62 The best-known measure of k-mer distance is based on the D2 statistic . Building on a 63 proposal by Blaisdell , D2 is simply the count of exact k-mer matches between two sequences, summed over all k-mers at a given k. As this count depends on the sequence lengths, D2 is often normalised by the probability of k-mer occurrence, or by assuming a Poisson distribution51,64. Chan 65 et al. introduced a neighbourhood variant. Even so, for D2-based measures to be applied confidently, particularly in the comparison of closely related sequences, understanding the k-mer structure of actual genomes would be highly desirable60,66-69.

23 Workflows of distance matrix methods leading to AF trees differ little from their classical counterparts, except that MSA is not required (Figure 2). Putatively homologous sequences (e.g. genomes) are assembled and quality-checked, e.g. for illegal characters. K is selected (see below), k-mers are extracted, and distances are computed pairwise (above) and assembled into a triangular matrix that is input into software that implements neighbour-joining70,71 or another distance-based algorithm. Because distance algorithms build trees by clustering sequences rather than by estimating a measure of changes along internal edges, some authorities consider them non- phylogenetic. Here we follow Felsenstein72 (pages 145-146) in relegating this distinction to debates over classification, and for the purpose at hand accept distance as a legitimate basis for the statistical inference of phylogeny. Höhl & Ragan73 pointed out that shared k-mers could be arranged into a (very local) “alignment” matrix and used as input into likelihood, Bayesian or other (non-distance) algorithms for tree inference, although at the cost of the speed and scalability we hoped to secure by taking an AF approach in the first place.

A extraction of k-mers Seq1 Seq2 Seq3 Seq4 ACGTTTCAGAA... ACGTTTCATGG... ACGATTCAACC... ATGATTCATGG... ACGTTTC ACGTTTC ACGATTC ATGATTC 7-mers 7-mers 7-mers 7-mers CGTTTCA CGTTTCA CGATTCA TGATTCA GTTTCAG GTTTCAT GATTCAA GATTCAT ......

B ACGTTTC ACGTTTC ACGATTC ACGATTC CGTTTCA CGTTTCA CGATTCA TGATTCA GTTTCAG GTTTCAT GATTCAA GATTCAT ...... ACGTTTC ACGATTC CGTTTCA ...... pairwise comparison

C Seq1 Seq2 Seq3 Seq4

Seq1 0.0 pairwise Seq2 distance 0.3 0.0 matrix Seq3 0.5 0.4 0.0

Seq4 0.7 0.6 0.4 0.0

D Seq1 phylogenetic Seq2 tree Seq3 Seq4

Figure 2. An AF phylogenetic workflow in which (A) k-mers (k = 7, stride = 1) are extracted from four sequences (Seq1 through Seq4), (B) shared 7-mers are identified by pairwise comparisons, (C) a pairwise distance matrix is calculated, from which (D) a tree is computed using a distance- based method e.g. neighbour-joining.

24

K is the critical parameter in AF phylogenomics. As we depend on k-mers to capture homology signal, the value we select for k must be large enough to ensure that few k-mers are present in our analysis purely by chance, but not so large that informative k-mers are arbitrarily excluded and signal unnecessarily attenuated. The factors most important for selecting an optimal k are the alphabet (e.g. nucleotide versus amino acid) and the complexity, divergence and length of the sequences under investigation. Given the complexities of sequence evolution, the performance of AF methods is best assessed by computational simulation rather than analytically. Using evolver in PAML74 we simulated the evolution of sequences on a tree under a general time-reversible model to examine how G+C content, length of terminal or more-basal branches, rearrangement, truncation and among-site rate heterogeneity affect precision and recall. We also simulated indels, and explored trees generated under non-ultrametric and coalescent models65. Our results indicate that for k within an optimal range, many AF methods can perform well under basic scenarios5,65,75, indeed better than MSA in the presence of rearrangement or indels65.

In the case of empirical data the true tree is unknown, making it impossible to assess performance using measures of precision and recall. To compare AF methods under scenarios of sequence divergence, rearrangement, inversion and LGT we focused instead on sensitivity to change of parameter value (e.g. k), and accuracy in the sense of recovering accepted subtrees75. All nine AF methods we examined were robust against complex genome rearrangements or inversions, and most word-count methods were robust and computationally efficient against moderate levels of LGT. Performance varied with the level of divergence, with the word-count methods more accurate than match-length methods at higher divergence. The optimal size of k was sensitive to the extent of sequence divergence, but was little affected by the other scenarios we simulated. Thus for datasets of known divergence, AF methods might be applied without exploratory tuning of k, and could be expected to perform as well or better than MSA-based approaches. However, AF methods have not been rigorously examined under fully realistic scenarios in which different lineages may evolve at different or variable rates, under different models of substitution, and/or with biases that give rise to compositional convergence.

For large datasets of bacterial and archaeal genomes, we inferred biologically realistic AF trees in which many clades familiar from MSA-based studies were recovered. Our code generally ran quickly, yielding accurate trees for thousands of bacterial genomes in some tens of hours on a moderate-sized cluster76. Most differences between the AF and MSA trees involved terminal branches, i.e. the most-closely related genomes. We investigated a multiple-k approach in hopes that longer k might provide better resolution at the termini, while shorter k would be more

25 appropriate for the eroded signal at more-basal bipartitions. In our hands this was unsuccessful, but an adaptive or multi-k approach might bear more-systematic reinvestigation. Some AF methods can also be used directly on large NGS data, i.e. sets of reads or contigs with only basic assembly, or none at all10,57. Memory is the main limitation for k-mer-based approaches, but the actual demand depends on the implementation used, and can sometimes be traded-off against speed of computation. AF methods with optimal memory consumption are slower than the more memory-greedy methods, with current implementations limited to k = 32 in most cases50,53.

AF methods nonetheless retain certain limitations. In simulations, the D2-based methods we examined recover the reference topology when applied to sequences of length of 10,000 nt (e.g. genomes, operons), but are prone to errors at 1500 nt (genes) and fail at 250 nt (domons)65. By disregarding singleton k-mers (i.e. erroneous reads) it is possible to improve distance estimates at higher coverage, but this degrades the signal at lower coverage10. In the MSA context, distance methods are criticized for reducing the pairwise comparison between sequences to a single number, in the process losing information on patterns of conservation within and among sequences; this is true of k-mer distances as well. Alternative approaches might involve a k-mer substitution model, but this scarcely seems feasible if the substitution matrix would be high- dimensional, sparse and dependent on immense data for parameterisation. Indeed, we suspect that such an approach would be so computationally expensive that it would negate the advantages of taking an AF approach in the first place. Methods exist for computation with sparse matrices, but to our knowledge have not been explored in a phylogenetic or phylogenomic context.

2.1.6 Alignment-free approaches to lateral genetic transfer For phylogenomic analysis of genomes potentially affected by LGT, we must also identify and deconvolute vertical and lateral signal. In MSA-based phylogenomics this is done by appending a filter to the standard workflow: trees inferred for individual gene families are compared with a reference topology, and well-supported but conflicting bipartitions are taken as prima facie evidence of LGT77,78. The corresponding gene or protein family might then be excluded (to purify the vertical signal), or analysed separately to understand the sources, recipients, processes and impact of LGT.

At first glance, there is much to recommend a similar workflow for AF phylogenomics. An approach built on k-mers might liberate us from having to take genes, or any other pre-defined features, as the units of analysis. In MSA-based phylogenomics, incongruent signal can be traced back to the underlying gene-family MSA, but this does not tell us which of the aligned genes is/are responsible for the incongruence, the number and quality of alternative signals, or the number, quality or location of recombination breakpoints16,79. AF methods might give us fine-scale access

26 to some or all of this information. Indeed, with AF we have further options. A genomic region might have arisen by LGT if, in the absence of extenuating circumstances, it (a) unexpectedly shares k-mers with a distantly related genome, hence (b) exhibits an anomalously short D2 distance to that genome. This is why (c) a distance tree computed for that region will be topologically incongruent with that computed for a vertically inherited region, or a trusted reference tree. Interestingly, these lines of evidence exactly parallel the three main strategies for LGT detection80- 82.

To begin to explore these AF approaches, we simulated the evolution of DNA sequences on a tree using ALF83 or EvolSimulator84, then counted how many 21-mers are shared pairwise within a sliding window of length 60 nucleotides. In the first instance we did this in the absence of simulated LGT, so as to establish a baseline against which lateral regions could later be detected85 (Chua, Maetschke & Ragan, unpublished). This is a k-mer variant of approaches long used to find lateral regions within sequences based on anomalous G+C content, dinucleotide frequencies or codon usage18,81,82,86-92. We found that while the most-divergent sequences shared very few 21- mers (zero in most windows), a few windows shared as many as seven; although these are false positives, we could find no objective statistical criterion by which a hypothesis of LGT could be rejected for them. Conversely, with more-closely related sequence pairs most windows shared many 21-mers, but the variation was such that it would be impossible to recognise or bound a truly lateral region (Figure 3). Note also that a sliding-window approach could work only on assembled genomes or large scaffolds, not on masses of raw reads. The idea seemed promising, but something critical was missing.

27 S1 B Comparison between S1 and S26 A S2 S3 S4 S5

S6 count S7 S8 S9 mean 0 1 2 3 4 5 6 7 S10 k-mer count S11 0 5000 10000 15000 20000 25000 30000 across bins S12 bin S13 C S14 Comparison between S7 and S14 S15 S16 S17 S18 S19

S20 count S21 S22 S23 mean S24 0 40 80 120 k-mer count S25 0 5000 10000 15000 20000 25000 30000 35000 across bins S26 bin

Figure 3. A sliding-window approach of k-mer sharing between sequences, illustrated here using a set of 26 sequences simulated on the tree depicted at the left. Pairwise comparisons are shown for (A) two highly dissimilar sequences (S1 and S26), and (B) two highly similar sequences (S1 and S2). Each plot shows the number of matching 21-mers within a 60-nt window as it is incremented along S1.

Thanks to cross-disciplinary collaboration, we soon discovered what was missing. Document analysis involves concepts that can be identical or analogous to those in molecular phylogenetics93- 96 including the “contamination” of texts by lateral transfer97,98. A statistic known as term frequency–inverse document frequency (TF–IDF) is widely used to determine the importance of a word in a collection of documents: words which appear frequently in a document, but rarely in the rest of the corpus, carry greater importance for that document. A variant of TF–IDF might be used to detect lateral regions in molecular sequences. K-mers can be seen as analogues of words (albeit ones that sometimes overlap each other), groups of similar sequences as documents, and a sequence database as a corpus. Unlike in a classical MSA-based workflow, sequences must be arranged into groups, but subsequent steps are AF. Sequence regions that contain k-mers infrequent in their own group (TF) but frequent in another group (IDF) are inferred as instances of LGT from the donor group to the recipient sequence34. Our unsuccessful idea above (Figure 3) represented IDF without proper TF.

The resulting workflow (Figure 4) differs from AF workflows for purely vertical phylogenetics (e.g. Figure 2) in two main ways: the unit of analysis is not specified up-front, and sequences must be arranged into groups. Potential lateral segments are generated by merging k-mers that meet the IDF and TF requirements. A parameter G specifies the maximum allowable gap between k-mers

28 to be merged into a lateral segment; where investigated, the number of LGT detections and total detection length were relatively insensitive to G. The resulting segments are typically of different lengths, and may map to intergenic regions, gene fragments, entire and/or multiple genes. By contrast, grouping the sequences in an effective manner proved to be non-trivial, yet critical to performance34,35,99. TF–IDF performs best when sequences are similar within-group but dissimilar between groups; so if our goal is to infer LGT, the best grouping will probably capture hierarchical descent. Even so, it may remain “difficult to disentangle the effects of group number, size, composition and phylogenetic cohesion”35.

A Seq1 Group 1 grouping of Seq2 sequences Seq3 and k-mers Group 2 Seq4

BSeq1 Seq2 Seq3 Seq4 ACGTTTC ACGTATC ACGTTTC ACGATTC ...... GTTTCAG GTTTCAT ACGTTTC ACGTTTC ...... Group 1 Group 2 ACGTTTC ACGTTTC infrequent in frequent in Group 1 lateral Group 2 Seq1 Seq2 Seq3 Seq4 ACGTTTC ACGTTTC ACGATTC ACGATTC ...... GTTTCAG GTTTCAT ACGTTTC ACGTTTC ...... C ACGTTTC G > 2k GTTCGAA CGTTTCA TTCGAAT GTTTCAG L > G TCGAATG ACGTTTCAG ACGACTA....CGATCAG GTTTCAATG laterally transferred region

D Group 1 donor Group 1 donor lateral genetic transfer Seq1 Group 2 recepient Seq2 recepient

Figure 4. Simplified workflow illustrating the use of TF–IDF to identify lateral genetic transfer. (A) Four sequences (Seq1 through Seq4) are grouped, here into two groups (X and Y) based on a reference tree. (B) All k-mers (k = 7, stride = 1) from each sequence are compared against the k- mers found in each of X and Y. A k-mer that is infrequent in the group to which the sequence belongs (TF), but frequent in another group (IDF), illustrated here by ACGTTTC in Seq1 that is infrequent in X but frequent in Y, is inferred to be of lateral origin. (C) Laterally transferred regions are constructed from sets of nearby lateral k-mers, where nearby means separated by ≤ gap

29 G. For representation as a network, recipient sequences are subsumed into their respective groups with the result that transfers inferred from a donor group to a recipient sequence (D, left) are shown as from a donor group to a recipient group (D, right). For clique analysis, edge weight and directionality may further be ignored (see text).

With simulated data, it was obvious how to delineate groups and TF–IDF performed well, as measured by precision and recall, over a biologically realistic range of sequence lengths and evolutionary distances between and within groups. As expected, greater evolution post-LGT had a deleterious effect on performance, while deletions showed relatively little effect. With empirical data, groups (e.g. g-) known to engage in LGT were usually prominent in our TF– IDF analyses, while we inferred little or no LGT for groups known to be more quiescent35. Because only genes (not arbitrary regions) have Gene Ontology annotation, to study the functional implications of the inferred LGT we mapped the inferred lateral segments to genes, using data- dependent length and overlap thresholds35,99. For protein-coding genes, we might alternatively have asked whether the inferred lateral segments overlap regions that encode active sites or SCOP domains.

For groups related by a hierarchical tree, it may be possible to extract further information. If a genomic region is inferred to have received genetic material from two or more groups that are topologically adjacent on the tree, we might (depending on details) instead hypothesise that there had been a single transfer from a common ancestor of the donor groups. Imperfect overlap of the inferred lateral regions could be ascribed to the vagaries of subsequent evolution, and/or the IDF threshold being a blunt instrument. On the other hand, transfers from unrelated donor groups would render such a region an evolutionary mosaic35. Although gene loss and LGT among the donor lineages may present further complications, TF–IDF seems to promise a first-ever systematic look at the temporal dynamics of superposed transfers.

Summary information from TF–IDF analysis can be collected in the form of an LGT network. For simplicity of interpretation, recipient sequences are subsumed into their respective groups, and inferred transfer events consolidated as weights on the edges. Given the limitations of most clique- finding algorithms, the weights and directionality of edges are usually ignored99. Densely connected regions within an LGT graph – maximum cliques, maximal cliques and paracliques (cliques missing a few edges) – can be extracted using GrAPPA100. These structures demarcate genetic exchange communities (GECs), groups of taxa whose members have shared genetic material among themselves by LGT101. Taxa that retain membership across biologically reasonable values of k are considered core nodes of these GECs99. Other structures in an LGT graph may also be of biological interest, e.g. bridging nodes that connect cliques102,103. By

30 annotating nodes and edges with metadata e.g. on environment, genome type or vector, new perspectives may be gained on the genetic structure of the microbial biosphere, and on genetic flow within and across “independent genetic worlds”102,104-107.

The TF–IDF algorithm is highly scalable, running in O(nL log nL) time where n is the number of sequences and L their average length. Moreover, as the inferred edges are natively lateral and directional, computationally hard steps involving the generation of a reference topology and comparison with test trees are obviated. However, in its current implementation TF–IDF is somewhat greedy of memory, preventing its application to very large datasets. Clique-finding is computationally demanding, even though information on edge directionality and weight is typically ignored.

An analogous approach could be taken to identify regions of vertical inheritance. Wong & Ragan108 recognised core regions that find matches in other sequences, extended these regions using a criterion of mutual exclusivity, built a pairwise similarity graph and applied MCL109 to yield sets of putatively homologous subsequences they called MACHOS. In place of Smith- Waterman, match and extension criteria based on k-mers (above) could equally well be used. MACHOS correspond well to known Pfam domain families108, and offer an AF approach to recognition of orthologs110. It has been argued that workflows in which a gene or protein family is, by default, considered to be inherited vertically unless this null hypothesis is specifically rejected gives a conceptually and methodologically unfair advantage to vertical inheritance111,112. Doolittle goes so far as to call this a “false null”111. Parallel AF workflows for lateral and vertical regions could address this objection, inferring “LGT directly, positively and fairly in large genome-scale datasets”99.

2.1.7 Conclusion The power of AF approaches relies on proper selection of k. The requirement that k-mers be approximately unique to a sequence can be satisfied at a much smaller k for amino acids (alphabet size 20) than for nucleotides (alphabet size 4). For tree inference, optimal k depends on the length and divergence of the sequences, and (more weakly) on the inference method. In our hands, k is optimal at about 3-5 for proteins, and 8-10 for genes or RNAs33,65,73. We set k = 12 for a quick assessment of the relative divergence of microbial genome datasets35,75, while Greenfield & Roehm31 used k > 15 to identify organisms, genes and functions of interest using unique k-mers as tags. For genome trees, optimal k ranged from 8 for isolates of the same bacterial species, up to 25 across bacteria and archaea65,75. Elhai et al.92 used k = 8 to detect genes of recent lateral origin in microbial genomes, but needed to draw on additional lines of evidence to make their approach effective.

31 AF approaches are beginning to make their mark in phylogenomics and LGT research. K-mers can readily be extracted from sequences, indexed, stored and retrieved. They capture homology signal in evolving sequences, and counts or frequencies of shared k-mers can underpin measures of pairwise distance and the computation of distance trees. Distributions of k-mers among groups of genomes can reveal donor-recipient relationships in LGT, hence communities of genetic exchange, and may be informative on the temporal dynamics of reticulate evolution. Like their MSA-based counterparts, k-mer distance trees can be computed quickly and scale to very large data, without the complications of a substitution model (or multiple models for different sequence regions). Whether or not this is, on balance, a good thing remains to be seen, even apart from the question of whether it makes sense to infer a genome tree113. MSA-based methods have benefitted from more than four decades of development, in the process enriching all their component fields, biological and otherwise. By contrast, AF phylogenomics is still in its infancy. We anticipate that AF methods will mature to provide dependable options in large-scale phylogenomics, while stimulating the exploration of other biological questions previously unimaginable within the classical framework.

2.1.8 References

1 Delsuc, F., Brinkmann, H. & Philippe, H. Phylogenomics and the reconstruction of the tree of life. Nat. Rev. Genet. 6, 361-375, doi:10.1038/nrg1603 (2005).

2 Eisen, J. A. & Fraser, C. M. Phylogenomics: intersection of evolution and genomics. Science 300, 1706-1707, doi:10.1126/science.1086292 (2003).

3 Pollock, D. D., Eisen, J. A., Doggett, N. A. & Cummings, M. P. A case for evolutionary genomics and the comprehensive examination of sequence biodiversity. Mol. Biol. Evol. 17, 1776- 1788 (2000).

4 Sicheritz-Ponten, T. & Andersson, S. G. A phylogenomic approach to microbial evolution. Nucleic Acids Res 29, 545-552 (2001).

5 Ragan, M. A., Bernard, G. & Chan, C. X. Molecular phylogenetics before sequences: oligonucleotide catalogs as k-mer spectra. RNA Biol 11, 176-185, doi:10.4161/rna.27505 (2014).

6 Margoliash, E. Homology: a definition. Science 163, 127 (1969).

7 Feng, D. F. & Doolittle, R. F. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25, 351-360 (1987).

32 8 Carrillo, H. & Lipman, D. The multiple sequence alignment problem in biology. SIAM J. Appl. Math. 48, 1073-1082, doi:10.1137/0148063 (1988).

9 Chan, C. X. & Ragan, M. A. Next-generation phylogenomics. Biol. Direct 8, 3, doi:10.1186/1745-6150-8-3 (2013).

10 Fan, H., Ives, A. R., Surget-Groba, Y. & Cannon, C. H. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics 16, 522, doi:10.1186/s12864-015-1647-5 (2015).

11 Darling, A. E., Mau, B. & Perna, N. T. progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5, e11147, doi:10.1371/journal.pone.0011147 (2010).

12 Earl, D. et al. Alignathon: a competitive assessment of whole-genome alignment methods. Genome Res. 24, 2077-2089, doi:10.1101/gr.174920.114 (2014).

13 Tavaré, S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 17, 57-86 (1986).

14 Beiko, R. G., Harlow, T. J. & Ragan, M. A. Highways of gene sharing in prokaryotes. Proc Natl Acad Sci U S A 102, 14332-14337, doi:10.1073/pnas.0504068102 (2005).

15 Gogarten, J. P. & Townsend, J. P. Horizontal gene transfer, genome innovation and evolution. Nat. Rev. Microbiol. 3, 679-687, doi:10.1038/nrmicro1204 (2005).

16 Chan, C. X., Beiko, R. G., Darling, A. E. & Ragan, M. A. Lateral transfer of genes and gene fragments in prokaryotes. Genome Biol. Evol. 1, 429-438, doi:DOI: 10.1093/gbe/evp044 (2009).

17 Chan, C. X., Darling, A. E., Beiko, R. G. & Ragan, M. A. Are protein domains modules of lateral genetic transfer? PLoS ONE 4, e4524 (2009).

18 Lawrence, J. G. & Ochman, H. Amelioration of bacterial genomes: rates of change and exchange. J Mol Evol 44, 383-397 (1997).

19 Gogarten, J. P., Doolittle, W. F. & Lawrence, J. G. Prokaryotic evolution in light of gene transfer. Mol. Biol. Evol. 19, 2226-2238 (2002).

20 Skippington, E. & Ragan, M. Within-species lateral genetic transfer and the evolution of transcriptional regulation in Escherichia coli and Shigella. BMC Genomics 12, 532 (2011).

33 21 Hide, W., Burke, J. & Davison, D. B. Biological evaluation of d2, an algorithm for high- performance sequence comparison. J. Comput. Biol. 1, 199-215, doi:10.1089/cmb.1994.1.199 (1994).

22 Myers, E. W. A sublinear algorithm for approximate keyword searching. Algorithmica 12, 345-374, doi:10.1007/bf01185432 (1994).

23 Miller, R. T. et al. A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. Genome Res. 9, 1143-1155 (1999).

24 Sameith, K., Roscito, J. G. & Hiller, M. Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly. Brief. Bioinform. 18, 1-8, doi:10.1093/bib/bbw003 (2017).

25 Compeau, P. E., Pevzner, P. A. & Tesler, G. How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol. 29, 987-991, doi:10.1038/nbt.2023 (2011).

26 Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955-961, doi:10.1038/ng.3331 (2015).

27 Lee, D., Karchin, R. & Beer, M. A. Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 21, 2167-2180, doi:10.1101/gr.121905.111 (2011).

28 Haubold, B., Krause, L., Horn, T. & Pfaffelhuber, P. An alignment-free test for recombination. Bioinformatics (Oxford, England) 29, 3121-3127, doi:10.1093/bioinformatics/btt550 (2013).

29 Haubold, B. Alignment-free phylogenetics and population genetics. Brief Bioinform 15, 407-418, doi:10.1093/bib/bbt083 (2014).

30 Vinga, S. & Almeida, J. Alignment-free sequence comparison-a review. Bioinformatics 19, 513-523 (2003).

31 Greenfield, P. & Roehm, U. Answering biological questions by querying k-mer databases. Concurrency and Computation: Practice and Experience 25, 497-509, doi:10.1002/cpe.2938 (2013).

32 Haubold, B. & Pfaffelhuber, P. Alignment-free population genomics: an efficient estimator of sequence diversity. G3 (Bethesda) 2, 883-889, doi:10.1534/g3.112.002527 (2012).

33 Höhl, M., Rigoutsos, I. & Ragan, M. A. Pattern-based phylogenetic distance estimation and tree reconstruction. Evol Bioinform Online 2, 359-375 (2006).

34 34 Cong, Y., Chan, Y. B. & Ragan, M. A. A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF. Sci. Rep. 6, 30308, doi:10.1038/srep30308 (2016).

35 Cong, Y., Chan, Y. B. & Ragan, M. A. Exploring lateral genetic transfer among microbial genomes using TF-IDF. Sci. Rep. 6, 29319, doi:10.1038/srep29319 (2016).

36 Rigoutsos, I., Huynh, T., Floratos, A., Parida, L. & Platt, D. Dictionary-driven protein annotation. Nucleic Acids Res 30, 3901-3916 (2002).

37 Maetschke, S. R. et al. A visual framework for sequence analysis using n-grams and spectral rearrangement. Bioinformatics (Oxford, England) 26, 737-744, doi:10.1093/bioinformatics/btq042 (2010).

38 Giegerich, R. & Kurtz, S. From Ukkonen to McCreight and Weiner: a unifying view of linear-time suffix tree construction. Algorithmica 19, 331-353, doi:10.1007/pl00009177 (1997).

39 Manber, U. & Myers, G. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 935-948, doi:10.1137/0222058 (1993).

40 Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics (Oxford, England) 27, 764-770, doi:10.1093/bioinformatics/btr011 (2011).

41 Greenfield, P., Duesing, K., Papanicolaou, A. & Bauer, D. C. Blue: correcting sequencing errors using consensus and context. Bioinformatics (Oxford, England) 30, 2723-2732, doi:10.1093/bioinformatics/btu368 (2014).

42 Melsted, P. & Pritchard, J. K. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics 12, 333, doi:10.1186/1471-2105-12-333 (2011).

43 Bloom, B. H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 422-426, doi:10.1145/362686.362692 (1970).

44 Holley, G., Wittler, R. & Stoye, J. Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol. Biol. 11, 3, doi:10.1186/s13015-016- 0066-8 (2016).

45 Abouelhoda, M. I., Kurtz, S. & Ohlebusch, E. in Algorithms in Bioinformatics: Second International Workshop, WABI 2002 Rome, Italy, September 17–21, 2002 Proceedings (eds Roderic Guigó & Dan Gusfield) 449-463 (Springer Berlin Heidelberg, 2002).

46 Delcher, A. L. et al. Alignment of whole genomes. Nucleic Acids Res 27, 2369-2376 (1999).

35 47 Miller, J. R. et al. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics (Oxford, England) 24, 2818-2824, doi:10.1093/bioinformatics/btn548 (2008).

48 Kurtz, S., Narechania, A., Stein, J. C. & Ware, D. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics 9, 517, doi:10.1186/1471-2164-9-517 (2008).

49 Haubold, B., Pfaffelhuber, P., Domazet-Loso, M. & Wiehe, T. Estimating mutation distances from unaligned genomes. J Comput Biol 16, 1487-1500, doi:10.1089/cmb.2009.0106 (2009).

50 Jun, S. R., Sims, G. E., Wu, G. A. & Kim, S. H. Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution. Proc Natl Acad Sci U S A 107, 133-138, doi:10.1073/pnas.0913033107 (2010).

51 Reinert, G., Chew, D., Sun, F. & Waterman, M. S. Alignment-free sequence comparison (I): statistics and power. J Comput Biol 16, 1615-1634, doi:10.1089/cmb.2009.0198 (2009).

52 Russell, D. J., Way, S. F., Benson, A. K. & Sayood, K. A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences. BMC Bioinformatics 11, 601, doi:10.1186/1471-2105-11-601 (2010).

53 Wang, H., Xu, Z., Gao, L. & Hao, B. A fungal phylogeny based on 82 complete genomes using the composition vector method. BMC Evol Biol 9, 195, doi:10.1186/1471-2148-9-195 (2009).

54 Bromberg, R., Grishin, N. V. & Otwinowski, Z. Phylogeny reconstruction with alignment- free method that corrects for horizontal gene transfer. PLoS Comput. Biol. 12, e1004985, doi:10.1371/journal.pcbi.1004985 (2016).

55 Goke, J., Schulz, M. H., Lasserre, J. & Vingron, M. Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics 28, 656-663, doi:10.1093/bioinformatics/bts028 (2012).

56 Leimeister, C. A., Sohrabi-Jahromi, S. & Morgenstern, B. Fast and accurate phylogeny reconstruction using filtered spaced-word matches. Bioinformatics (Oxford, England), btw776, doi:10.1093/bioinformatics/btw776 (2017).

57 Yi, H. & Jin, L. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res 41, e75, doi:10.1093/nar/gkt003 (2013).

58 Ulitsky, I., Burstein, D., Tuller, T. & Chor, B. The average common substring approach to phylogenomic reconstruction. J Comput Biol 13, 336-350, doi:10.1089/cmb.2006.13.336 (2006).

36 59 Torney, D. C., Burks, C., Davison, D. & Sirotkin, K. M. in Computers and DNA: the Proceedings of the Interface between Computation Science and Nucleic Acid Sequencing Workshop Vol. 7 (eds G Bell & R. Marr) 109-125 (Addison-Wesley, 1990).

60 Forêt, S., Wilson, S. R. & Burden, C. J. Characterizing the D2 statistic: word matches in biological sequences. Stat. Appl. Genet. Mol. Biol. 8, 43, doi:10.2202/1544-6115.1447 (2009).

61 Lippert, R. A., Huang, H. & Waterman, M. S. Distributional regimes for the number of k- word matches between two random sequences. Proc Natl Acad Sci U S A 99, 13980-13989, doi:10.1073/pnas.202468099 (2002).

62 Song, K. et al. New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinform 15, 343-353, doi:10.1093/bib/bbt067 (2014).

63 Blaisdell, B. E. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci U S A 83, 5155-5159 (1986).

64 Wan, L., Reinert, G., Sun, F. & Waterman, M. S. Alignment-free sequence comparison (II): theoretical power of comparison statistics. J Comput Biol 17, 1467-1490, doi:10.1089/cmb.2010.0056 (2010).

65 Chan, C. X., Bernard, G., Poirion, O., Hogan, J. M. & Ragan, M. A. Inferring phylogenies of evolving sequences without multiple sequence alignment. Sci. Rep. 4, 6504 (2014).

66 Burden, C. J., Jing, J. & Wilson, S. R. Alignment-free sequence comparison for biologically realistic sequences of moderate length. Stat. Appl. Genet. Mol. Biol. 11, 3 (2012).

67 Burden, C. J., Leopardi, P. & Foret, S. The distribution of word matches between Markovian sequences with periodic boundary conditions. J. Comput. Biol. 21, 41-63, doi:10.1089/cmb.2012.0277 (2014).

68 Chor, B., Horn, D., Goldman, N., Levy, Y. & Massingham, T. Genomic DNA k-mer spectra: models and modalities. Genome Biol. 10, R108, doi:10.1186/gb-2009-10-10-r108 (2009).

69 Liu, B. et al. Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. arXiv:1308.2012 (2013).

70 Saitou, N. & Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4, 406-425 (1987).

71 Studier, J. A. & Keppler, K. J. A note on the neighbor-joining algorithm of Saitou and Nei. Mol. Biol. Evol. 5, 729-731 (1988).

37 72 Felsenstein, J. Inferring phylogenies. (Sinauer, 2004).

73 Höhl, M. & Ragan, M. A. Is multiple-sequence alignment required for accurate inference of phylogeny? Syst. Biol. 56, 206-221, doi:10.1080/10635150701294741 (2007).

74 Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586-1591, doi:10.1093/molbev/msm088 (2007).

75 Bernard, G., Chan, C. X. & Ragan, M. A. Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer. Scientific Reports 6, 28970, doi:10.1038/srep28970 (2016).

76 Bernard, G., Ragan, M. & Chan, C. Recapitulating phylogenies using k-mers: from trees to networks [version 2; referees: 2 approved]. F1000Res. 5, 2789 (2016).

77 Beiko, R. G. & Ragan, M. A. Detecting lateral genetic transfer: a phylogenetic approach. Methods Mol. Biol. 452, 457-469, doi:10.1007/978-1-60327-159-2_21 (2008).

78 Chan, C. X., Beiko, R. G. & Ragan, M. A. Scaling up the phylogenetic detection of lateral gene transfer events. Methods Mol. Biol. 1525, 421-432, doi:10.1007/978-1-4939-6622-6_16 (2017).

79 Chan, C. X., Beiko, R. G. & Ragan, M. A. Detecting recombination in evolving nucleotide sequences. BMC Bioinformatics 7, 412, doi:10.1186/1471-2105-7-412 (2006).

80 Clarke, G. D., Beiko, R. G., Ragan, M. A. & Charlebois, R. L. Inferring genome trees by using a filter to eliminate phylogenetically discordant sequences and a distance matrix based on mean normalized BLASTP scores. J. Bacteriol. 184, 2072-2080 (2002).

81 Ragan, M. A. Detection of lateral gene transfer among microbial genomes. Curr Opin Genet Dev 11, 620-626 (2001).

82 Ragan, M. A. On surrogate methods for detecting lateral gene transfer. FEMS Microbiol. Lett. 201, 187-191 (2001).

83 Dalquen, D. A., Anisimova, M., Gonnet, G. H. & Dessimoz, C. ALF—a simulation framework for genome evolution. Mol. Biol. Evol. 29, 1115-1123, doi:10.1093/molbev/msr268 (2012).

84 Beiko, R. G. & Charlebois, R. L. A simulation test bed for hypotheses of genome evolution. Bioinformatics 23, 825-831, doi:10.1093/bioinformatics/btm024 (2007).

38 85 Maetschke, S. R., McIntyre, L., Chan, C. X. & Ragan, M. A. LGTNet: fast inference of lateral genetic transfer networks, (2013).

86 Becq, J., Churlaud, C. & Deschavanne, P. A benchmark of parametric methods for horizontal transfers detection. PLoS ONE 5, e9989, doi:10.1371/journal.pone.0009989 (2010).

87 Dufraigne, C., Fertil, B., Lespinats, S., Giron, A. & Deschavanne, P. Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucleic Acids Res 33, e6, doi:10.1093/nar/gni004 (2005).

88 Garcia-Vallvé, S., Romeu, A. & Palau, J. Horizontal gene transfer in bacterial and archaeal complete genomes. Genome Res. 10, 1719-1725 (2000).

89 Lawrence, J. G. & Ochman, H. Molecular archaeology of the Escherichia coli genome. Proc Natl Acad Sci U S A 95, 9413-9417 (1998).

90 Médigue, C., Rouxel, T., Vigier, P., Henaut, A. & Danchin, A. Evidence for horizontal gene transfer in Escherichia coli speciation. J. Mol. Biol. 222, 851-856 (1991).

91 Ragan, M. A., Harlow, T. J. & Beiko, R. G. Do different surrogate methods detect lateral genetic transfer events of different relative ages? Trends Microbiol. 14, 4-8, doi:10.1016/j.tim.2005.11.004 (2006).

92 Elhai, J., Liu, H. & Taton, A. Detection of horizontal transfer of individual genes by anomalous oligomer frequencies. BMC Genomics 13, 245, doi:10.1186/1471-2164-13-245 (2012).

93 Robinson, P. M. W. & O'Hara, R. J. Cladistic analysis of an Old Norse manuscript tradition. Res. Human. Comput. 4, 115-137 (1996).

94 Howe, C. J. et al. Manuscript evolution. Trends Genet. 17, 147-152 (2001).

95 Marmerola, G. D., Oikawa, M. A., Dias, Z., Goldenstein, S. & Rocha, A. On the reconstruction of text phylogeny trees: evaluation and analysis of textual relationships. PLoS ONE 11, e0167822, doi:10.1371/journal.pone.0167822 (2016).

96 Sims, G. E., Jun, S. R., Wu, G. A. & Kim, S. H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Natl Acad Sci U S A 106, 2677- 2682, doi:10.1073/pnas.0813249106 (2009).

97 Lee, A. R. Numerical revisited: John Griffith, cladistic analysis and St. Augustine's Quaestiones in Heptateuchum. Studia Patristica 20, 24-32 (1989).

39 98 Ragan, M. A. & Lee, A. R. in The Unity of Evolutionary Biology, Proceedings of Fourth International Congress of Systematic and Evolutionary Biology (ed T. R. Dudley) 432-441 (Dioscorides Press, 1991).

99 Cong, Y., Chan, Y. B., Phillips, C. A., Langston, M. A. & Ragan, M. A. Robust inference of genetic exchange communities from microbial genomes using TF-IDF. Front. Microbiol. 8, 21, doi:10.3389/fmicb.2017.00021 (2017).

100 Lu, Y., Phillips, C. A. & Langston, M. A. GrAPPA. Graph Algorithms Pipeline for Pathway Analysis, (2015).

101 Skippington, E. & Ragan, M. A. Lateral genetic transfer and the construction of genetic exchange communities. FEMS Microbiol. Rev. 35, 707-735, doi:10.1111/j.1574- 6976.2010.00261.x (2011).

102 Halary, S., Leigh, J. W., Cheaib, B., Lopez, P. & Bapteste, E. Network analyses structure genetic diversity in independent genetic worlds. Proc Natl Acad Sci U S A 107, 127-132, doi:10.1073/pnas.0908978107 (2010).

103 Liu, W., Pellegrini, M. & Wang, X. Detecting communities based on network topology. Sci. Rep. 4, 5739, doi:10.1038/srep05739 (2014).

104 Dagan, T. & Martin, W. Getting a better picture of microbial evolution en route to a network of genomes. Philos. Trans. R. Soc. Lond. B Biol. Sci. 364, 2187-2196, doi:10.1098/rstb.2009.0040 (2009).

105 Fondi, M. & Fani, R. The horizontal flow of the plasmid resistome: clues from inter- generic similarity networks. Environ Microbiol 12, 3228-3242, doi:10.1111/j.1462- 2920.2010.02295.x (2010).

106 Koonin, E. V. The turbulent network dynamics of microbial evolution and the statistical Tree of Life. J. Mol. Evol. 80, 244-250, doi:10.1007/s00239-015-9679-7 (2015).

107 Puigbò, P., Wolf, Y. I. & Koonin, E. V. The tree and net components of prokaryote evolution. Genome Biol Evol 2, 745-756, doi:10.1093/gbe/evq062 (2010).

108 Wong, S. & Ragan, M. A. MACHOS: Markov clusters of homologous subsequences. Bioinformatics (Oxford, England) 24, i77-i85, doi:10.1093/bioinformatics/btn144 (2008).

109 Enright, A. J., Van Dongen, S. & Ouzounis, C. A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30, 1575-1584 (2002).

40 110 Shin, C. J., Davis, M. J. & Ragan, M. A. Towards the mammalian interactome: Inference of a core mammalian interaction set in mouse. Proteomics 9, 5256-5266, doi:10.1002/pmic.200900262 (2009).

111 Doolittle, W. F. The practice of classification and the theory of evolution, and what the demise of Charles Darwin's tree of life hypothesis means for both of them. Philos. Trans. R. Soc. Lond. B Biol. Sci. 364, 2221-2228, doi:10.1098/rstb.2009.0032 (2009).

112 Doolittle, W. F. & Bapteste, E. Pattern pluralism and the Tree of Life hypothesis. Proc Natl Acad Sci U S A 104, 2043-2049, doi:10.1073/pnas.0610699104 (2007).

113 Doolittle, W. F. Lateral gene transfer, genome surveys, and the phylogeny of prokaryotes. Response from Doolittle. Science, 1443a (1999).

41 CHAPTER 3: PHYLOGENETIC INFERENCE OF MOLECULAR SEQUENCES USING ALIGNMENT-FREE APPROACHES

The potential of AF approaches in accurate phylogenetic inference is little known. In this Chapter, I evaluate the performance of AF approaches to infer accurate phylogenies under diverse evolutionary scenarios, using both simulated and empirical sequence data from bacteria and archaea.

This Chapter is presented in the form of two manuscripts, addressing Aims 1 and 3 (Chapter 1, section 1.5). The first manuscript presents proof-of-concept that computational AF approaches can produce results that are identical to the earlier findings by Carl Woese and colleagues, which were based on short oligonucleotides of the universal phylogenetic markers, 16S and 18S ribosomal rRNAs. This work was published in the journal RNA Biology (doi: 10.4161/rna.27505) and edited for this thesis. As the second author of this paper, I designed the framework, conducted all the analyses, prepared the figures and tables, and helped writing and editing the manuscript.

The second manuscript includes a study that assessed the sensitivity and robustness of alignment- free approaches at the gene level to various biological scenarios in molecular evolution, namely sequence divergence, among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions. This work was published in Scientific Reports (doi: 10.1038/srep06504) and edited for this thesis. As the second author of this paper, I designed the framework, conducted all the analyses on the simulated data and helped write and edit the manuscript. The supplementary material for this manuscript is presented in Appendix A.

42 3.1 Molecular phylogenetics before sequences: Oligonucleotide catalogs as k- mer spectra

3.1.1 Abstract

From 1971 to 1985, Carl Woese and colleagues generated oligonucleotide catalogs of 16S/18S rRNAs from more than 400 organisms. Using these incomplete and imperfect data, Carl and his colleagues developed unprecedented insights into the structure, function, and evolution of the large RNA components of the translational apparatus. They recognized a third domain of life, revealed the phylogenetic backbone of bacteria (and its limitations), delineated taxa, and explored the tempo and mode of microbial evolution. For these discoveries to have stood the test of time, oligonucleotide catalogs must carry significant phylogenetic signal; they thus bear re-examination in view of the current interest in alignment-free phylogenetics based on k-mers. Here we consider the aims, successes, and limitations of this early phase of molecular phylogenetics. We computationally generate oligonucleotide sets (e-catalogs) from 16S/18S rRNA sequences, calculate pairwise distances between them based on D2 statistics, compute distance trees, and compare their performance against alignment-based and k-mer trees. Although the catalogs themselves were superseded by full-length sequences, this stage in the development of computational molecular biology remains instructive for us today.

3.1.2 Introduction

A basic goal of biology is to account for the evolution of the cell. Emergence of the translation apparatus is the single most important event in this evolution, for capacity to translate is what defines genotype and phenotype1.

From our vantage point, informed as we are by petabytes of sequence and structural data from all manner of organisms, it is easy to forget how little was known of the molecular basis of life when Carl Woese began his career in research. Carl was awarded his PhD in 1953, the year Watson and Crick published the double-helical structure of DNA. At about the same time Keller et al.2 localized protein synthesis to a “microsomal” fraction of the cell, within which RNA-rich particles, later termed ribosomes, were soon discovered3. In 1959–1961 two large RNA molecules were revealed as components of the ribosome—one sedimenting at 16S, the other at 23S4.

In early publications, Carl described how DNA, RNA, and microsomal fractions behaved during the germination of bacterial spores5,6. In late 1960 he initiated research on the genetic code, and over the next few years made fundamental contributions to our understanding of its origin, universality, and specificity. He was among the first to consider translation in an explicitly

43 evolutionary perspective7-9 and emphasized the role of RNA; for example, in refocusing the basis of genetic code specificity away from steric interactions among amino acids: “in an important sense, the codon ‘chooses’ its amino acid, not the reverse”10.

Through these early years, the structure of RNAs remained unclear; indeed, not until the early 1960s was it established that RNAs were linear polymers, i.e., can be referred to as having a sequence. In the early 1950s, Fred Sanger and collaborators had developed a stepwise experimental strategy to reveal the structure of insulin as a sequence of amino acids. Each chain was enzymatically cleaved into oligopeptides; these were separated and laboriously characterized, and from the fragmentary sequences large portions of the original protein sequence were reassembled11. By about 1960 it was becoming clear that protein sequences were non-random and contained regions with different degrees of conservation.

A similar strategy was soon applied to elucidate the structure of some viral RNAs. RNAs were digested with pancreatic ribonuclease and the products separated using chromatography and electrophoresis, yielding mono-, di-, and tri-nucleotides consistent with an unbranched linear sequence of ribonucleotides linked by phosphodiester bonds12. Ten of these products, with lengths from one to four nucleotides, could be readily identified based on their electrophoretic mobility alone, while others were identified via a combination of strategies13. Differences in the relative abundance of dinucleotides were interpreted as demonstrating differences in “the sequential arrangement of nucleotides” that were imagined to underlie biological differences among viruses14.

The Sanger protocol was later modified to introduce an initial digestion with ribonuclease T1, thereby generating only a single product from G15. With the separation technology then available, about 40 oligomers in the length range from one to five nucleotides could be resolved; this was not sufficient to distinguish Escherichia coli 16S from 23S rRNA, although in due course methodological improvements provided access to higher oligomers16,17. Sequences of tRNA18 and 5S rRNA19 yielded to other protocols that generated overlapping fragments; but in 1964, when Carl took up an appointment to the faculty of the University of Illinois, no RNA molecule had been fully sequenced.

3.1.2.1 The evolutionary approach to conserved structure and function

Ideas of interrelationships among structure, function, and evolution run deep in the history of biology. Karl von Baer, the French transcendental morphologists and others variously glimpsed parts of this nexus, albeit from pre-Darwinian perspectives and with a mechanical interpretation of function. The appearance of protein sequences in the early 1960s brought renewed interest in 44 relationships among ancestry, conserved and variable regions of sequence and structure, and molecular function19-27. In a 1969 letter to Francis Crick28, Carl referred to this history embedded in molecules as the cell’s “internal fossil record.”

Carl understood that a comparative approach would likewise reveal which regions of RNAs were conserved, hence, functionally important. Already in 1961 he had compared nucleotide compositions of 16S and 23S rRNA fractions in different bacteria29 and in the 1969 letter to Crick he wrote of his “important and nearly irreversible decision” to “determine primary structures for a number of genes in a very diverse group of organisms, on the hope that by deducing rather ancient ancestor sequences for these genes, one will eventually be in the position of being able to see features of the cell’s evolution. The obvious choice of molecules here lies in the components of the translation apparatus. What more ancient lineages are there?”28.

Carl directed some effort to 5S rRNA30 and 23S rRNA31 but his main focus was on 16S rRNA. Beginning in 1971, Carl and his coworkers at Illinois, and in due course collaborators in Halifax and Munich, generated oligonucleotide catalogs for 16S rRNAs from about 400 organisms32,33. Their comparative approach quickly bore fruit, with the observation that the sets of oligonucleotides from Escherichia coli and Bacillus megaterium 16S rRNAs were much more similar than expected by chance:

“It is important to explain the existence of sequence homology between these two 16S rRNA species. If it reflects the fact that certain portions of their common ancestral primary structure are locked into the present sequences due to stringent constraints imposed by structural and/or functional considerations, then the conservation becomes highly significant. However, were the frequency of occurrence of mutations in rRNA cistrons to be sufficiently low for some reason, then the bulk of the observed conservation could merely reflect the fact that mutations had not occurred in those regions in either organism, and conservation would be of trivial significance.”1

The second, alternative hypothesis is amenable to experiment, and their comparison with a third 16S rRNA, represented by a partial catalog from Alcaligenes faecalis, and with the “unrelated” 14S rRNA of Rhodopseudomonas spheroides and the 18S rRNA of yeast, may have constituted the first validation in computational molecular biology. Although not a proof, additional sequences could be brought into the comparison until the argument for homology becomes undeniable.

Pechman and Woese1 also concluded that “(i)n a molecule as large as the 16S rRNA, all residues are clearly not equivalent in their importance to molecular function.” Some residues are neutral and would be replaced quickly on an evolutionary timescale, whereas others are functionally

45 constrained such their replacement would have to be compensated by a “more or less simultaneous” change of other residues. The more deeply such a “replacement unit” is entangled into molecular function, the longer its mutational “half-life,” and the more informative it might be on basal features in the tree. 16S/18S rRNA was a “compound, non-linear chronometer”34 whose broad-range applicability arises not from its size per se, but rather because each of its more-or-less independent structural domains embeds covariance sets that inform on different scales of evolutionary time, much as the hands of a clock separately indicate hours, minutes, and seconds35.

For Carl, the “ultimate goal in comparative studies of rRNA sequence is to construct a chronometric model of the molecule that permits its potential as an evolutionary measuring device to be fully exploited.”36 He formalized this deeply structural (i.e., not purely statistical or cladistic) concept as covariance sets of nucleotides. In due course, sets of co-varying positions would be mapped onto folded structure; but in the meantime, the path to covariance sets lay through oligonucleotide catalogs and signature analysis.

3.1.2.2 Oligonucleotide catalogs

In this context, a catalog is the list of oligomers identified following enzymatic digestion of an RNA or protein. Complete digestion of an RNA with T1 ribonuclease15 yields non-overlapping oligonucleotides that end in G. Although at first only short oligonucleotides could be resolved and identified, by the mid-1970s the upper limit on length had been pushed well into the teens, and in one case to 2437. Incompletely characterized oligomers, those with modified bases, and termini, were often included in these catalogs; short oligomers (for 16S rRNAs, 5-mers and below) contributed no additional resolving power, and were often not reported.

An RNA dinucleotide catalog was presented by Reddi14 and catalogs with larger oligonucleotides were published by Rushizky and Knight38, Sanger15, and others. More than 30 16S/18S rRNAs had been oligo-cataloged by 197539, more than 170 by 198040, and more than 400 by 198536. Most of these data were transferred to punch cards17 and organized as a database with search, comparison, and tree-inference tools41.

3.1.2.3 Comparing catalogs and computing trees

Sydney Fox and Paul Homeyer42 compared partial amino acid compositions in seed globulins of six plants43, and in 24 protein types mostly from animals (Table 1 of ref. 44). The idea of

46 combinatorially based diversity can be discerned in their publication, but Fox and Homeyer did not discuss sequences per se. Importantly, however, they interpreted these composition data as showing that “protein synthesis has not, in the main, yet become sufficiently diverse through molecular evolution to yield substantially unrelated proteins.” In modern terminology, protein structure as reflected in 1-mers did not seem to be evolving so fast that historical signal would be lost. This had not been shown before, and set the scene for the subsequent development of molecular phylogenetics.

As we mention above in the context of primary-structural determination, peptides from protease digests could be separated in two dimensions by paper chromatography and electrophoresis, and compared by eye for similarities and differences22,45,46. František Šorm and colleagues47,48 were arguably the first to use patterns and frequencies of di-, tri-, and tetra-peptides not only to explore regularities in protein structure, but also to compare “proteins which have the same function but differ in their origin (different animal species), and proteins of similar function and a common origin.”48 As summarized by Williams et al.49, Šorm thought that his work demonstrated that “even where complete sequences are not known, the number of peptides common to two proteins can be used to show similarity of their primary structures.”

New techniques were needed to compare sets of oligomers. Two sequences might be compared by eye (e.g., ref. 50), but this is neither scalable nor statistically rigorous. Citing a standard statistical 51 text Carl selected for this purpose the binary association coefficient (SAB): twice the sum of nucleotides in oligonucleotides common to a pair of catalogs, divided by the total number of nucleotides in the two catalogs52. Short oligomers were omitted, and no background correction was made (see ref. 53). Carl was nonetheless distrustful about comparing catalogs in this (or any other automated) way: the oligonucleotide data were biased (ribonuclease T1 does not cleave randomly, and electrophoresis at low pH separates some oligonucleotides more cleanly than it does others), and families of similar, probably homologous, oligonucleotides were mostly ignored.

But more fundamentally for Carl, SAB values could not capture molecular structure.

Later, when full sequences had become available, Carl plotted pairwise SAB values between catalogs against percent similarity of aligned 16S rRNA sequences, revealing an imprecise 35 relationship for SAB less than about 0.40, i.e., most of them . Carl criticized his earlier catalog approach as (1) not having resolved branching orders among major bacterial divisions and subdivisions, and (2) failing to resolve branching order of rapidly evolving lineages such as the . Catalogs and pairwise SAB values could not offer the resolving power that was available from the rRNA chronometer as read via sequences; nor should we “consider the second

47 hand when timing the seasons.”35

Given a matrix of pairwise SAB values, a dendrogram could be computed by average linkage clustering. The first rRNA oligonucleotide trees appeared in 197653 and 197737,54. Fox et al.37 asserted that although this approach is phyletic, “it is clear from the molecular nature of the data” that the topology “would closely resemble, if not be identical to, that of a phylogenetic tree based upon such ancestral catalogs.” These trees might be a guide to relatedness and relative antiquity (e.g., ref. 40), but Carl did not delineate taxa solely on the basis of trees36.

3.1.2.4 Signatures

More important than trees was the “internal fossil record” revealed through signatures. Carl defined a signature as a “set of oligonucleotides that is characteristic of (unique to) a group of organisms,” but immediately relaxed this to allow oligonucleotides to “occur in half or more of the members of the group, but are either not found in other organisms or occur only sporadically therein.”55 Slightly different formulations were offered later35,36. Modulo this relaxation, signatures were synapomorphies (Carl Woese, personal communication to MAR, 30 August 1988).

Carl immersed himself in the details. As related by George Fox, during the heyday of the oligonucleotide work “Carl had established routines that allowed him to be with the fingerprints 8 hours a day, 5 days a week. He went to great lengths to avoid interruptions and non-research related activities.”17 Carl’s knowledge of patterns of oligonucleotide occurrence and co-variation, and his ability to map details immediately onto folded structure, convinced one of us (MAR) that he had an exquisitely detailed mental map of 16S rRNA structure and evolution, as Emanuel Margoliash surely had for cytochrome c. In any case, to Carl a signature was a deeply structural and chronometric construct36. not to be entrusted to generic (or even purpose-built) software.

Carl’s group managed and compared signatures, and computed trees, with the aid of mainframe computing. Tom Macke wrote a program “sig” that could map the distribution of oligonucleotides, including degenerate ones, across a set of catalogs36,56,57. Similar programs are mentioned by Sobieski et al.41 In those years, hardware and operating systems were far less standardized than today, and it was not straightforward to exchange programs, much less to offer remote access.

All these factors conspired to make signature analysis à la Woese somewhat opaque to outsiders, including the numerical taxonomy and cladistics communities. Zablen et al.39 clearly articulate the

48 value of shared derived characters; Fox et al.52 describe an approach seemingly inspired by parsimony; and Carl mentions parsimony analyses elsewhere in passing, e.g., reference 35. Once 16S rRNA structures became available58,59, Carl mapped these signatures onto folded structure.

Taxa could at last be recognized by three criteria (page 236 of ref. 35): coherence by SAB, shared sequence signature, and higher-order molecular structure.

3.1.2.5 Oligonucleotides and k-mers

Sequences or regions thereof can be arranged relative to each other to reveal similarities and differences; the term “alignment” was introduced for such operations in 196060, although the concept has deeper roots in genetics, computer science, and other fields. Peptides and proteins were aligned first, then tRNAs in 196661 and 5S rRNA in 197150. These early alignments were based on visual inspection, but as the comparison problem began to be described more precisely for analysis using electronic computers, three not unrelated classes of approaches emerged62. In today’s terminology these are the sliding-window, dot-matrix, and k-mer spectrum approaches.

Dot-matrix methods were prefigured by Walter Fitch63 and others prior to their formal description by Gibbs and MacIntyre64. Adrian Gibbs and colleagues considered the dot-matrix to subsume the sliding-window approach62, and to be “similar in principle”64 to a method explained by Saul Needleman to Fitch65 in 1965 and later introduced as the first algorithm for full-length sequence alignment66. Applied to molecular sequences, all these approaches find regions of local identity (or similarity). Like oligonucleotides matched between two catalogs, these local regions are not of predefined length; rather, their frequency spectrum (number at each increment of length) is determined by the degree and pattern of pairwise sequence similarity, and by data quality.

Alternatively, sequence analysis can be approached using a fixed word length. In the BLAST algorithm67 for example, the query sequence is hashed into regions of predetermined length. Similar operations are encountered in diverse areas of mathematics, computer science, and information theory e.g., for sequence compression, indexing, or retrieval. Reflecting these diverse origins and applications, short perfectly matched strings of predetermined length are variously termed k-mers, words, or n-grams. A common thread is that these strings provide a fast approach to detecting a signal of similarity. K-mers find utility in many areas of genomics including genome size estimation, assembly, clustering, and studies on sequence periodicity and lateral genetic transfer68,69.

In molecular phylogenetics, k-mers have long been used to capture phylogenetic signal. Gibbs et al.62 used dipeptide frequencies (k = 2) to compute phylogenetic trees based on sequences of cytochromes c, hemoglobins, and other proteins. Blaisdell70 did likewise for a broader set of 49 proteins, with k = 2 and k = 3. More recently, tree inference has used values of k in the range 3–5 for proteins71,72, and longer k has been proposed for nucleotides73.

Below we look back on Carl’s oligonucleotide catalogs as a source of data for phylogenetic inference. With the benefit of complete 16S/18S rRNA sequences, we ask about the accuracy and coverage of T1 oligonucleotide catalogs, and compare Carl’s clustering diagrams with trees based on multiple alignment of complete sequences and inference methods. Because most original T1 catalogs are no longer accessible in an electronic format, we computationally reconstruct e- catalogs from full-length rRNA sequences of the 13 organisms examined by Woese and Fox74, # 75-78 compare them with selected empirical catalogs, calculate !" statistics , and compute a neighbor-joining (NJ) tree79. We then do the same for a more complete set of bacteria40. Thereafter, # we extract k-mers (at different values of k) from the full-length sequences, and again calculate !" statistics, and compute NJ trees. This allows us to explore similarities and differences between oligonucleotide catalogs and modern k-mer spectra in phylogenetics.

3.1.3 Results

3.1.3.1 Trees from aligned full-length sequences

As a reference topology, we inferred a tree based on full-length 16S/18S rRNA sequences of the 13 organisms in Woese and Fox74 or very close relatives. Multiple sequence alignment (i.e., not leveraging the folded structure of rRNA) followed by fast maximum-likelihood (Fig. 1A) or Bayesian inference (Fig. 1B) yielded trees differing from each other in two respects: the position of the cyanobacterial/chloroplast subtree within the bacteria, and branching order within eukaryotes. These disagreements correspond to very short internal edges and poor bootstrap support in the likelihood tree (Fig. 1A). We followed the same approach to infer trees from aligned full-length 16S rRNA sequences from eight proteobacteria (Fig. 4 of ref. 40), with a Synechocystis rRNA as outgroup (Fig. 2A and B).

50

Figure 1. Trees for 16S/18S rRNAs in the three- data set74 inferred via multiple sequence alignment of full-length rRNAs using MUSCLE and (A) RAxML or (B) MRBAYES; (C) computed via neighbor-joining from the similarity matrix in reference 74; (D) calculated via # # !" and neighbor-joining from our e-catalogs; and calculated via !" and neighbor-joining from k- mer spectra at (E) k = 6, (F) k = 8, (G) k = 12, or (H) k = 16. To facilitate comparison, all trees were rooted similarly (arbitrarily on archaea), except for (C) in which trees were rooted independently on archaea (left), bacteria (middle), and eukaryotes (right)

51

Figure 2. Trees for 16S rRNA in the proteobacterial data set40 inferred via multiple sequence alignment of full-length rRNAs using MUSCLE and (A) RAxML or (B) MRBAYES; (C) # # calculated via !" and neighbor-joining from our e-catalogs; and calculated via !" and neighbor- joining from k-mer spectra at (D) k = 6, (E) k = 8, (F) k = 12, or (G) k = 16. To facilitate comparison, all trees were rooted similarly on the 16S rRNA of the cyanobacterium Synechocystis.

3.1.3.2 Trees from published SAB matrices

74 Woese and Fox present a matrix of pairwise association coefficients (SAB) between oligonucleotide catalogs (length ≥ 6), but do not depict the tree these data imply. We converted these SAB values to distances, and computed the NJ tree (Fig. 1C). Rooted at any point outside the three clusters of sequences, this tree clearly reveals three main lines of descent. Woese and Fox74 do not treat branching structure within each kingdom, but the topology we reconstruct within the bacterial lineage is congruent with the cluster diagram published at about the same time by Balch et al.54 Later, with data from additional bacteria, Chlorobium assumed a more-basal position40. However, Synechocystis and the Lemna chloroplast appear paraphyletic, as do

52 Methanobrevibacter and Methanothermobacter among Archaea.

3.1.3.3 Computational generation of e-catalogs

We had hoped to generate trees from the original oligonucleotide catalog data underlying Woese and Fox74 but were able to access only six of the 13 catalogs, and part of a seventh (George Fox has more recently recovered others for us). So instead, starting with full-length 16/18S rRNA sequences from the same or very closely related organisms (Table 1), we computationally generated sets of oligonucleotides, mimicking digestion with ribonuclease T1. Fragments at the 5′ and 3′ termini were included, and oligomers of length < 6 were removed. We refer to these sets as e-catalogs.

Table 1. All 16S ribosomal rRNA sequences used in this study, their GenBank accession numbers, and their inclusion in our re-analysis of rRNAs from (A) three kingdoms74 and (B) proteobacteria (Fig. 4 of ref. 40). For proteobacteria in our analysis B, we identify class (α, β, gamma, or delta- proteobacteria).

Source organism GenBank accession Analysis Mus musculus X00686.1 A Saccharomyces cerevisiae V01335.1 A Spathiphyllum wallisii AF207023.1 A Methanobacterium ruminantium NR_074117.1 A Methanoculleus marisnigri NR_074174.1 A Methanosarcina barkeri NR_074253.1 A Methanothermobacter thermautotrophicus NR_074260.1 A Bacillus firmus JQ282815 A Chlorobium vibrioforme M62791 A Corynebacterium diphtheriae NR_103937.1 A Lemna minor chloroplast NC_010109.1* A Synechocystis sp. NR_074311.1 A, B Rhodobacter sphaeroides (alpha) NR_029215.1 B Rhodospirillum rubrum (alpha) NR_074249.1 B Rhizobium leguminosarum (alpha) D14513.1 B Alcaligenes faecalis (beta) AF155147.1 B Desulfovibrio desulfuricans (delta) NR_036778.1 B Escherichia coli (gamma) NR_102804.1 A, B Yersinia pestis (gamma) NR_074199.1 B Pseudomonas aeruginosa (gamma) NR_074828.1 B *, positions 106162–107648.

53 3.1.3.4 Comparison of empirical and e-catalogs

To determine the extent to which our e-catalogs recapitulate Carl’s empirical T1 catalogs (and can thus stand in for the latter in tree inference), we compared e-catalogs and original T1 catalogs for Escherichia coli and Methanobacterium ruminantium M-1 (later renamed Methanobrevibacter ruminantium M1). For the purpose of this comparison, we ignored base modifications (e.g., treated A* as identical to A) and copy number, and resolved ambiguities in the empirical data in favor of a match. Table 2 demonstrates that our e-catalogs recapitulate the empirical oligonucleotides very well, although not perfectly. Mismatches likely arise due to weak, diffuse (e.g., Figure 1 of ref. 52), or incompletely resolved spots on paper electrophoresis (e.g., Figure 1 of ref. 16), although sequencing errors, covalent modifications, and/or strain differences cannot be ruled out. It is clear from Table 3 that the landmark recognition of three kingdoms74, and molecular-systematic studies on numerous groups of bacteria and archaea, were based on data representing fewer than 40% of the positions in the 16S rRNA. This is less worrisome than might be thought; many of these oligonucleotides map to one side of a helical region, such that much of the “missing” information is in fact represented as the reverse complement (see Figure 2 of ref. 17).

Table 2. Numbers of unique oligonucleotides in empirical 16S rRNA catalogs, and of k-mers in e-catalogs.

Oligomer Escherichia coli Methanobacterium ruminantium length or k empirical e-catalog match empirical e-catalog match

6a 21b 21 21 22 22 20

7 17 16 16 15 16 13

8 10 11 10 14 15 13

9 13 12 12 10 9 8

≥10 11 13 10 11 12 10

Total 72 73 69 72 74 64 Escherichia coli empirical catalog from Uchida et al.16 as corrected by Magrum et al.,88 and Methanobacterium ruminantium M-1 (later renamed Methanobrevibacter ruminantium M1) empirical catalog from Fox et al.37 For the calculation of matching, modifications of bases are ignored and ambiguities are resolved favourably. aIncludes the 5¢ termimus. bUchida et al.16 report one 6-mer sequence twice, once as unmodified and once as modified; for the purposes of this table we count them once.

54 Table 3. Nucleotide coverage of full-length 16S rRNA sequence by oligonucleotides in empirical catalogs, and k-mers in e-catalogs, of Escherichia coli and Methanobacterium ruminantium M-1 (Methanobrevibacter ruminantium M1).

Number Coverage Number Coverage 16S rRNA source (empirical) (%) (e-catolog) (%)

E. coli 584/1542 37.9 590/1542 38.3

M. ruminantium 572/1436 39.8 601/1436 41.9 For catalogs, see Supplemental Data 1. Multiple (non-unique) instances are counted (note that Fox et al.37 do not report multiple occurrences, which in any case were rare for oligonucleotides ≥ 6). Full-length sequences are NR_102904.1 and NR_074117.1, respectively. 3.1.3.5 Trees from e-catalogs

# From the e-catalogs we calculated pairwise distances via the !" statistic (Materials and Methods), and computed an NJ tree (Fig. 1D). This tree shows the three-kingdom structure. Topology within the archaeal (methanogen) subtree agrees with that in Fox et al.37,40 and with our k-mer trees (Fig. 1E–H, for which see below); for simplicity we call this the 2M+2M topology within Archaea. Among bacteria, the Synechocystis-chloroplast and Bacillus-Corynebacterium pairs seen in the alignment-based trees are apparent here too, but Escherichia and Chlorobium rRNAs no longer form a monophyletic group, instead appearing as adjacent branches. Pairwise SAB values for these bacterial catalogs are in the range 0.19–0.34. Recalling that Carl35 called attention to the imprecise relationship between SAB and full-length sequence similarity especially at SAB < 0.40, we selected a different bacterial data set (from Fig. 4 of ref. 40) with pairwise SAB values in the range 0.31– # 0.78, and again calculated !" values and distances. The topology of this tree (Fig. 2C) agrees with the alignment-based references (Fig. 2A and B) and differs from that implied by Fox et al.40 only in the relative branching positions of the most-basal branches; that is, again the differences 40 correspond with the smallest SAB values, and short internal edges .

3.1.3.6 K-mer trees from full-length sequences

We extracted k-mers from full-length sequences at selected values of k between 6 and 16, # calculated pairwise !" values and distances, and used these to compute NJ trees for the three- kingdom74 and bacterial data sets40. The three-kingdom structure, and branching order within Archaea, do not depend on choice of k within this range; branching order within bacteria, and within eukaryotes, does (Fig. 1E–H). The expected cyanobacterium–chloroplast and Bacillus– Corynebacterium pairs are apparent across all k = 6, 8, 12, or 16, while the other two bacterial sequences, Escherichia coli and Chlorobium vibrioforme, show no consistent position. This is perhaps unsurprising, as even today basal branching in the bacterial tree can scarcely be resolved80. As above, we therefore examined a less-divergent bacterial data set (from Fig. 4 of ref. 40). At k 55 # = 6, 8, or 12 (Fig. 2D–F) our !" -based NJ trees agree with the alignment-based reference (Fig. 2A and B). Even at k = 16 (Fig. 2G), much of the expected internal structure is preserved.

3.1.4 Discussion

From about 1971 through the mid-1980s, Carl Woese and colleagues generated T1 oligonucleotide catalogs for more than 400 organisms, mostly bacteria and archaea, with the aim of understanding the nexus among structure, function, and evolution for the RNA components of the translational apparatus. Using tools that in retrospect seem basic—nuclease digestion, radiolabelling, paper electrophoresis, binary association coefficients, clustering algorithms, and simple statistical models of expected similarity—Carl and his colleagues revolutionized the way we view the living world. Recognition of the three kingdoms of life, a phylogenetic backbone of the microbial world, and natural groupings of various size, taxonomic depth, and biological specialization all arose from Carl’s interpretation of the molecular fossil record internal to 16S/18S rRNA, via the deeply structural idea of the molecular chronometer that intertwines structure, sequence, and evolution for sufficiently large rRNA molecules35,36.

For this fundamental biology to have emerged and withstand the test of time, T1 oligonucleotide catalogs—incomplete sets of unordered, short, and somewhat noisy sequences—must carry phylogenetic signal. To be sure, their power of resolution wears thin at greater depths (low SAB values), but this is true as well for complete sequences using modern methods80.

Empirical oligonucleotide catalogs sample surprisingly little of the full-length sequence (Table 3), although rather more of its information content (see above). Carl, who was using these catalogs (along with other approaches) to reconstruct full-length sequences, was well aware of this, but argued that oligonucleotides of length ≤ 4, which accounted for much of the sequence not represented in the catalogs, were in any case uninformative about homology81; length 5 was “marginal.” The same argument had earlier been made for short oligopeptides in tryptic digests (e.g., ref. 22). By contrast, k-mers represent the entire sequence, base-paired, and uninformative regions along with informative ones.

Three kingdoms are apparent in all the trees we compute from e-catalogs or k-mers, as is the 2M+2M arrangement within archaea (methanogens). By contrast, within bacteria the branching order is somewhat unstable, particularly for the more-basal branches. Interestingly, the same features are poorly resolved in a modern curated resource, with structure-guided multiple alignment of full-length sequences, and RAxML inference of trees80. As for the eukaryotic subtree, the inability of 18S rRNA sequence analysis to resolve the branching order of the green plant,

56 fungal, and animal lineages is well known82.

It has not been our aim here to illustrate the full spectrum of so-called alignment-free approaches and methods, nor to compute k-mer trees for other genes, proteins, concatenated gene sets, or full genomes. We hope that these analyses will stimulate reflection and deeper analysis where warranted, on how and why catalog-based methods could underpin the revolutionary era in microbiology associated with Carl Woese. Thanks to next-generation and community sequencing technologies, microbiology again faces large, imperfect, and not entirely familiar data; new analytical, comparative, and computational approaches are in play, while non-evolutionary directions beckon. Carl understood that only an evolutionary framework could link genotype with phenotype, and molecular structure with function. With the support of colleagues and great personal determination, Carl built that framework. His life and achievements are, and will long remain, an inspiration.

3.1.5 Materials and Methods

Data. All 16S/18S rRNA sequences used in this study are listed in Table 1. We obtained full- length rRNA sequences from organisms and strains that are identical, or as closely related as possible, to those examined by Woese and Fox74 (Table S1). For closer examination of the bacterial lineage, we selected from the organisms in Figure 4 of reference 40 in a way that gives representation of the four major proteobacterial groupings: α (3), β (1), δ (1), and γ (3), with the 16S rRNA of the cyanobacterium Synechocystis sp. as outgroup (Table S2).

Generation of e-catalogs. To mimic T1 RNase digestion, we computationally cleaved each full- length sequence (Table 1) immediately 3′ of each guanine (G) residue, yielding a set of strings that end in G. Terminal fragments were included in each set, while strings of length < 6 were removed. For ease of handling, we ordered each list first by increasing size, then alphabetically. Our e- catalogs of the Woese and Fox74 three-kingdom organism set are presented in Data S1, and those of the bacterial set40 in Data S2.

Phylogenetic analysis. For each of the two sets of full-length rRNA sequences, we performed multiple sequence alignment using MUSCLE83 at default settings, then inferred trees using MRBAYES84 and RAxML85. MRBAYES parameter settings were: MCMC ngen = 5 000 000 generations, nchain = 4, burnin = 2 500 000 generations. RAxML (-m GTRGAMMA) was run # 77 with 100 bootstraps. For the k-mer-based approach, for each sequence set we applied !" statistics independently at k = 6, 8, 12, and 16, yielding a score for each possible pair of sequences within each set. These scores were transformed via logarithmic representation of the geometric mean, to

57 generate a distance.

!&' %&' = ln !&&×!''

The pairwise distance dab between sequences a and b is defined as where Dab is the pairwise score, and Daa and Dbb are the respective self-matching scores (See Appendix A). The resulting distance matrix generated for each k was used to reconstruct a phylogenetic tree using neighbor in PHYLIP # v3.69. Similarly, using the in silico oligomer catalog for each full-length sequence as input for !" we calculated pairwise scores and distances for all pairs within a sequence set. The resulting distance matrix was used to compute a tree using neighbor in PHYLIP v3.69.

3.1.6 References

1 Pechman KJ, Woese CR. Characterization of the primary structural homology between the 16s ribosomal RNAs of Escherichia coli and Bacillus megaterium by oligomer cataloging. J Mol Evol, 1:230-40 (1972).

2 Keller EB, Zamecnik PC, Loftfield RB. The role of microsomes in the incorporation of amino acids into proteins. J Histochem Cytochem, 2:378-86 (1954).

3 Roberts RB. Introduction. In: Roberts RB, ed. Microsomal particles and protein synthesis. New York: Pergamon Press, 1958: vii-viii (1958).

4 Bąkowska-Żywicka K, Tyczewska A. The structure of the ribosome – short history. Biotechnologia, 1:14-23(2009).

5 Woese CR, Forro JR. Correlations between ribonucleic acid and deoxyribonucleic acid metabolism during spore germination. J Bacteriol, 80:811-7 (1960).

6 Woese CR, Langridge R, Morowitz HJ. Microsome distribution during germination of bacterial spores. J Bacteriol, 79:777-82 (1960).

7 Woese CR. Universality in the genetic code. Science, 144:1030-1 (1964).

8 Woese CR. Order in the genetic code. Proc Natl Acad Sci U S A, 54:71-5 (1965).

9 Woese CR. On the evolution of the genetic code. Proc Natl Acad Sci U S A, 54:1546-52 (1965).

10 Woese CR, Dugre DH, Saxinger WC, Dugre SA. The molecular basis for the genetic code.

58 Proc Natl Acad Sci U S A, 55:966-74 (1966).

11 Stretton AOW. The first sequence. Fred Sanger and insulin. Genetics, 162:527-32 (2002).

12 Markham R, Smith JD. The structure of ribonucleic acids. II. The smaller products of ribonuclease digestion. Biochem J, 52:558-65 (1952).

13 Sanger F. Sequences, sequences, and sequences. Annu Rev Biochem, 57:1-28 (1988).

14 Reddi KK. Structural differences in the nucleic acids of some tobacco mosaic virus strains. II. Di- and tri-nucleotides in ribonuclease digests. Biochim Biophys Acta, 32:386-92 (1959).

15 Sanger F, Brownlee GG, Barrell BG. A two-dimensional fractionation procedure for radioactive nucleotides. J Mol Biol, 13:373-98 (1965).

16 Uchida T, Bonen L, Schaup HW, Lewis BJ, Zablen L, Woese C. The use of ribonuclease U2 in RNA sequence determination. Some corrections in the catalog of oligomers produced by ribonuclease T1 digestion of Escherichia coli 16S ribosomal RNA. J Mol Evol, 3:63-77 (1974).

17 Sapp J, Fox GE. The singular quest for a universal tree of life. Microbiol Mol Biol Rev, 77:541-50 (2013).

18 Holley RW, Apgar J, Everett GA, Madison JT, Marquisee M, Merrill SH, Penswick JR, Zamir A. Structure of a ribonucleic acid. Science, 147:1462-5 (1965).

19 Brownlee GG, Sanger F, Barrell BG. Nucleotide sequence of 5S-ribosomal RNA from Escherichia coli. Nature, 215:735-6 (1967).

20 Ingram VM. Gene evolution and the haemoglobins. Nature, 189:704-8 (1961).

21 Hill RL, Buettner-Janusch J, Buettner-Janusch V. Evolution of haemoglobin in primates. Proc Natl Acad Sci U S A, 50:885-93 (1963).

22 Zuckerkandl E, Jones RT, Pauling L. A comparison of animal hemoglobins by truptic peptide pattern analysis. Proc Natl Acad Sci U S A, 46:1349-60 (1960).

23 Zuckerkandl E, Pauling L. Evolutionary divergence and convergence in proteins. In: Bryson V, Vogel HJ, eds. Evolving Genes and Proteins. New York and London: Academic Press, 1965:97-166 (1965).

24 Fitch WM, Margoliash E. Construction of phylogenetic trees. Science, 155:279-84 (1967).

59 25 Fitch WM, Margoliash E. The usefulness of amino acid and nucleotide sequences in evolutionary studies. Evol Biol, 4:67-109 (1970).

26 Dickerson RE. The structures of cytochrome c and the rates of molecular evolution. J Mol Evol, 1:26-45 (1971).

27 Zuckerkandl E. On the molecular evolutionary clock. J Mol Evol, 26:34-46 (1987).

28 Woese CR. (Department of Microbiology, University of Illinois). Letter to: Francis H.C. Crick. (1969).

29 Woese CR. Composition of various ribonucleic acid fractions from micro-organisms of different deoxyribonucleic acid composition. Nature, 189:920-1 (1961).

30 Sogin SJ, Sogin ML, Woese CR. Phylogenetic measurement in procaryotes by primary structural characterization. J Mol Evol, 1:173-84 (1972).

31 Woese CR. Primary structure homology within the 23S ribosomal RNA. Nature, 220:923 (1968).

32 Woese CR, Stackebrandt E, Ludwig W. What are mycoplasmas: the relationship of tempo and mode in bacterial evolution. J Mol Evol, 21:305-16 (1985).

33 McGill TJ, Jurka J, Sobieski JM, Pickett MH, Woese CR, Fox GE. Characteristic archaebacterial 16S rRNA oligonucleotides. Syst Appl Microbiol, 7:194-7 (1986).

34 Woese CR. (Department of Microbiology, University of Illinois). Lecture at Nobel Symposium 70 (Karskoga, Sweden) from notes of Mark A. Ragan (then Atlantic Research Laboratory, National Research Council Canada). (1988).

35 Woese CR. Bacterial evolution. Microbiol Rev, 51:221-71 (1987).

36 Woese CR, Stackebrandt E, Macke TJ, Fox GE. A phylogenetic definition of the major eubacterial taxa. Syst Appl Microbiol, 6:143-51 (1985).

37 Fox GE, Magrum LJ, Balch WE, Wolfe RS, Woese CR. Classification of methanogenic bacteria by 16S ribosomal RNA characterization. Proc Natl Acad Sci U S A, 74:4537-41 (1977).

38 Rushizky GW, Knight CA. An oligonucleotide mapping procedure and its use in the study of tobacco mosaic virus nucleic acid. Virology, 11:236-49 (1960).

60 39 Zablen LB, Kissil MS, Woese CR, Buetow DE. Phylogenetic origin of the chloroplast and prokaryotic nature of its ribosomal RNA. Proc Natl Acad Sci U S A, 72:2418-22 (1975).

40 Fox GE, Stackebrandt E, Hespell RB, Gibson J, Maniloff J, Dyer TA, Wolfe RS, Balch WE, Tanner RS, Magrum LJ, et al. The phylogeny of prokaryotes. Science, 209:457-63 (1980).

41 Sobieski JM, Chen KN, Filiatreau JC, Pickett MH, Fox GE. 16S rRNA oligonucleotide catalog data base. Nucleic Acids Res, 12:141-8 (1984).

42 Fox SW, Homeyer PG. A statistical evaluation of the kinship of protein molecules. Am Nat, 89:163-8 (1955).

43 Smith EL, Greene RD. Further studies on the amino acid composition of seed globulins. J Biol Chem, 167:833-42 (1947).

44 Haurowitz F. Chemistry and biology of proteins. New York: Academic Press, (1950).

45 Ingram VM. A specific chemical difference between the globins of normal human and sickle-cell anaemia haemoglobin. Nature, 178:792-4 (1956).

46 Ingram VM. Abnormal human haemoglobins. I. The comparison of normal human and sickle-cell haemoglobins by fingerprinting. Biochim Biophys Acta, 28:539-45 (1958).

47 Šorm F. On proteins. XXV. The chemical structure of proteins. Introduction. [in Russian]. Collect Czech Chem Commun, 19:1003-5 (1954).

48 Šorm F, Keil B. Holeyšovský, Knesslová V, Kostka V, Mäsiar P, Meloun B, Mikeš O, Tomášek V, Vaněček J. On proteins. XXXIX. Structural resemblance in certain proteins. Collect Czech Chem Commun, 22:1310-29 (1957).

49 Williams J, Clegg JB, Mutch MO. Coincidence and protein structure. J Mol Biol, 3:532- 40 (1961).

50 DuBuy B, Weissman SM. Nucleotide sequence of Pseudomonas fluorescens 5 S ribonucleic acid. J Biol Chem, 246:747-61, (1971).

51 Anderberg MR. Cluster analysis for applications. New York: Academic Press, (1973).

52 Fox GE, Pechman KR, Woese CR. Comparative cataloging of 16S ribosomal ribonucleic acid: molecular approach to procaryotic systematics. Int J Syst Bacteriol, 27:44-57 (1977).

61 53 Bonen L, Doolittle WF. Partial sequences of 16S rRNA and the phylogeny of blue-green algae and chloroplasts. Nature, 261:669-73 (1976).

54 Balch WE, Magrum LJ, Fox GE, Wolfe RS, Woese CR. An ancient divergence among the bacteria. J Mol Evol, 9:305-11 (1977).

55 Woese CR, Maniloff J, Zablen LB. Phylogenetic analysis of the mycoplasmas. Proc Natl Acad Sci U S A, 77:494-8 (1980).

56 Woese CR, Stackebrandt E, Weisburg WG, Paster BJ, Madigan MT, Fowler VJ, Hahn CM, Blanz P, Gupta R, Nealson KH, et al. The phylogeny of purple bacteria: the alpha subdivision. Syst Appl Microbiol, 5:315-26 (1984).

57 Woese CR, Weisburg WG, Paster BJ, Hahn CM, Tanner RS, Keieg NR, Koops H-P, Harms H, Stackebrandt E. The phylogeny of purple bacteria: the beta subdivision. Syst Appl Microbiol, 5:327-36 (1984).

58 Ehresmann C, Stiegler P, Mackie GA, Zimmermann RA, Ebel JP, Fellner P. Primary sequence of the 16S ribosomal RNA of Escherichia coli. Nucleic Acids Res 1975, 2:265-78.

59 Brosius J, Palmer ML, Kennedy PJ, Noller HF. Complete nucleotide sequence of a 16S ribosomal RNA gene from Escherichia coli. Proc Natl Acad Sci U S A, 75:4801-5 (1978).

60 Lanni F. Analysis of sequence patterns in ribonuclease, I. Sequence vectors and vector maps. Proc Natl Acad Sci U S A, 46:1563-76 (1960).

61 Zachau HG, Dütting D, Feldmann H. The structures of two serine transfer ribonucleic acids. Hoppe Seylers Z Physiol Chem, 347:212-35 (1966).

62 Gibbs AJ, Dale MB, Kinns MR, MacKenzie HG. The transition matrix method for comparing sequences, its use in describing and classifying proteins by their amino acid sequences. Syst Zool, 20:417-25 (1971).

63 Fitch WM. An improved method of testing for evolutionary homology. J Mol Biol, 16:9- 16 (1966).

64 Gibbs AJ, McIntyre GA. The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur J Biochem, 16:1-11 (1970).

65 Fitch WM. Locating gaps in amino acid sequences to optimize the homology between two

62 proteins. Biochem Genet, 3:99-108 (1969).

66 Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol, 48:443-53 (1970).

67 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol, 215:403-10 (1990).

68 Vinga S, Almeida J. Alignment-free sequence comparison-a review. Bioinformatics, 19:513-23 (2003).

69 Chan CX, Ragan MA. Next-generation phylogenomics. Biol Direct, 8:3 (2013).

70 Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci U S A, 83:5155-9 (1986).

71 Höhl M, Ragan MA. Is multiple-sequence alignment required for accurate inference of phylogeny? Syst Biol, 56:206-21 (2007).

72 Ragan MA, Chan CX. Biological intuition in alignment-free methods: response to Posada. J Mol Evol, 77:1-2 (2013).

73 Forêt S, Kantorovitz MR, Burden CJ. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences. BMC Bioinformatics, 7(Suppl 5):S21 (2006).

74 Woese CR, Fox GE. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc Natl Acad Sci U S A, 74:5088-90 (1977).

75 Chor B, Horn D, Goldman N, Levy Y, Massingham T. Genomic DNA k-mer spectra: models and modalities. Genome Biol, 10:R108 (2009).

76 Forêt S, Wilson SR, Burden CJ. Empirical distribution of k-word matches in biological sequences. Pattern Recognit, 42:539-48 (2009).

77 Reinert G, Chew D, Sun F, Waterman MS. Alignment-free sequence comparison (I): statistics and power. J Comput Biol, 16:1615-34 (2009).

78 Wan L, Reinert G, Sun F, Waterman MS. Alignment-free sequence comparison (II): theoretical power of comparison statistics. J Comput Biol, 17:1467-90 (2010).

63 79 Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol, 4:406-25 (1987).

80 Yarza P, Richter M, Peplies J, Euzeby J, Amann R, Schleifer K-H, Ludwig W, Glöckner FO, Rosselló-Móra R. The All-Species Living Tree project: a 16S rRNA-based phylogenetic tree of all sequenced type strains. Syst Appl Microbiol, 31:241-50 (2008).

81 Woese CR, Fox GE, Zablen L, Uchida T, Bonen L, Pechman K, Lewis BJ, Stahl D. Conservation of primary structure in 16S ribosomal RNA. Nature, 254:83-6 (1975).

82 Baldauf SL, Roger AJ, Wenk-Siefert I, Doolittle WF. A kingdom-level phylogeny of eukaryotes based on combined protein data. Science, 290:972-7 (2000).

83 Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 32:1792-7 (2004).

84 Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S, Larget B, Liu L, Suchard MA, Huelsenbeck JP. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol, 61:539-42 (2012).

85 Stamatakis A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics, 22:2688-90 (2006).

86 Pace NR, Sapp J, Goldenfeld N. Phylogeny and beyond: Scientific, historical, and conceptual significance of the first tree of life. Proc Natl Acad Sci U S A, 109:1011-8 (2012).

87 Nair P. Woese and Fox: Life, rearranged. Proc Natl Acad Sci U S A, 109:1019-21 (2012).

88 Magrum L, Zablen L, Stahl D, Woese C. Corrections in the catalogue of oliogonucleotides produced by digestion of Escherichia coli 16S rRNA with T1 RNase. Nature, 257:423-6 (1975).

64 3.2 Inferring phylogenies of evolving sequences without multiple sequence

alignment

3.2.1 Abstract

Alignment-free methods, in which shared properties of sub-sequences (e.g. identity or match length) are extracted and used to compute a distance matrix, have recently been explored for phylogenetic inference. However, the scalability and robustness of these methods to key evolutionary processes remain to be investigated. Here, using simulated sequence sets of various sizes in both nucleotides and amino acids, we systematically assess the accuracy of phylogenetic inference using an alignment-free approach, based on D2 statistics, under different evolutionary scenarios. We find that compared to a multiple sequence alignment approach, D2 methods are more robust against among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions, but are more sensitive to recent sequence divergence and sequence truncation. Across diverse empirical datasets, the alignment-free methods perform well for sequences sharing low divergence, at greater computation speed. Our findings provide strong evidence for the scalability and the potential use of alignment-free methods in large-scale phylogenomics.

3.2.2 Introduction

Multiple sequence alignment (MSA) has long been a standard stage in phylogenetic workflows1,2. In this approach, homologous sequences are first multiply aligned along their full length, yielding positional hypotheses of homology (alignment columns) that are input to maximum parsimony, maximum likelihood (ML) or Bayesian inference, or summarised in a distance matrix and used to compute a tree e.g. by neighbour-joining (NJ). A key assumption of MSA is that in each such set of sequences, homologous positions occur in the same order relative to one another. This is not fully realistic, as genes and genomes are subject to recombination, rearrangement and lateral genetic transfer3-5. In sequences so affected, the positional hypothesis of homology generated by MSA will be incomplete or incorrect, diffusing the phylogenetic signal, violating models of the substitution process across sites and branches, and consequently misleading phylogenetic inference6,7. These issues can only be intensified by the on-going deluge of sequencing data arising from advances in sequencing technologies8.

An alternative to MSA in phylogenetic inference is the so-called alignment-free approach in which pairwise similarity is computed from sub-sequences, e.g. counts of exact (or inexact) sub- sequences of defined length, or by extension, of conserved sequence patterns9,10, or alternatively

65 of match lengths11. These sub-sequences are known variously as words, k-mers or n-grams12; see refs 13-15 for recent reviews. A word-count approach for alignment-free sequence comparison 15-18 uses the !" statistic . A !" score is calculated based on the exact count of shared k-mers between any two sequences, thus representing the extent of similarity they share (see Supplementary Note for details). Since the profile of k-mers depends on length of the sequence, modifications have been proposed to accommodate this bias, e.g. normalising the !" score by the # probability of occurrence for each k-mer observed in the sequences (!" ), or by the mean and ∗ 17,18 # ∗ variance of k-mer occurrences (!") . These studies have demonstrated that !" and !" have 15,17,18 greater statistical power than !", and that this power increases with sequence length . These statistics can be easily transformed into a pairwise measure of dissimilarity or distance, which can then be used to compute phylogenetic relationships.

Alignment-free approaches have been adopted in searches of sequence databases19, clustering of expressed sequence tags20, and more recently in detecting lateral genetic transfer11. By directly computing pairwise dissimilarity or distance using these methods, one can bypass resource- intensive ML or Bayesian approaches in favour of NJ. Some methods implementing approximate ML measures21,22, although less accurate, are less resource-intensive. However, the sensitivity of alignment-free methods to different evolutionary scenarios, and the scalability of these methods, have not been systematically investigated.

Here, using both simulated and empirical data we assess the accuracy of alignment-free phylogenetic approaches using !" statistics compared to standard MSA-based approaches. Using sets of simulated nucleotide and amino acid sequences, we systematically examine the accuracy and sensitivity of !" methods to key molecular evolutionary processes including sequence divergence, among-site rate heterogeneity, biases of G+C content, genetic rearrangements and insertions/deletions, as well as to the technical issue of incomplete sequence data. We demonstrate the scalability and potential of using alignment-free approaches to compute phylogenetic trees quickly and accurately from large-scale DNA or protein data.

3.2.3 Results

# For our alignment-free phylogenetic approach, we used !" statistics (independently for !", !" , ∗ 17,18 !") to generate a score for each possible pair of sequences within a set. Here we also introduce - !" , a !" statistic that extends each k-mer recovered in the sequences to its neighbourhood n, i.e. allows n number of wildcard residue(s). This simple extension of !" is analogous to generation of high-scoring words for the query phase of BLAST23, and to a published alignment-free measure of sequence similarity24; a measure of inexact match has recently been extended to a position-

66 25 - -./ specific context . We denote cases of !" where n = 1 as !" hereinafter. Each of these metrics is described in the Supplementary Note. For each method, we transform the scores via logarithmic representation of the geometric mean to estimate evolutionary distances (see Methods). Each resulting distance matrix was then used to calculate phylogenetic relationships using NJ. For comparison, for each sequence set we performed MSA using the popular tool, MUSCLE26 and inferred a phylogenetic tree using the widely used MrBayes27. We use Robinson-Foulds distances28 to evaluate topological congruence between each of the resulting test trees and a reference tree, normalised to adjust for different tree sizes (see Methods for details). We denote RF as the normalised Robinson-Foulds distance. RF = 0 indicates that the test tree shows complete topological congruence with the reference, while RF = 1 indicates that the test tree has no bipartition in common with the reference. The RF for a test tree generated via one of the four !" methods is denoted as RFD2, RFD2S, RFD2* or RFD2n1, and the equivalent for a test tree generated via MSA and MrBayes is denoted as RFMSA.

3.2.3.1 Simulated data

Using simulated data, we independently assess the sensitivity of !" methods to variation in key evolutionary processes: sequence divergence, genetic rearrangement, and insertions/deletions. Because the phylogenetic tree is known for each simulated sequence set, we use that as the reference.

Sequence divergence. We simulated nucleotide sequence sets of various size categories N = 8, 32 and 128 (total length, L = 1500 nt). For each category, six sequence sets were simulated under an unrooted tree topology across distinct situations of relative branch lengths, with 0 = 1 in an 8- category discrete gamma distribution. Each of these trees (T1 through T6 in Fig. 1; shown for 8- taxon trees) represents a fine-scale scenario of sequence divergence, as determined by different combinations of internal (x) and terminal (y) branch lengths. In some simulations, we recognise two subsets of y (y1 and y2) of different length. Sets containing varied divergence levels had different combinations of x, y1 and y2 as shown in T2, T3, T5 and T6; these are the reference trees for the corresponding sequence sets. For 32- and 128-taxon trees, the topologies were simply expanded stepwise for each upper and lower half, as indicated in Fig. 1 (i.e. p1 and p2 represent individual expandable modules). For instance in a 128-taxon tree, the relative lengths (x, y1, y2) of the first 64 taxa follow pattern p1, while the others follow p2. For simplicity, x and y (or y1 and y2) were set at either 0.01 or 0.05 (unit in number of substitutions per site). The least-divergent (most- similar) sequence set (T1) was simulated with all branch lengths x = y1 = y2 = 0.01 (two most dissimilar sequences differ at 0.14 substitutions per site at N = 128), whereas the most-divergent

(most-dissimilar) set (T4) had x = y1 = y2 = 0.05 (two most dissimilar sequences differ at 0.70 67 substitutions per site at N = 128). The branch lengths in all these trees are short (two most dissimilar sequences in any set differ at <0.70 substitutions per site), so any MSA-based approaches should have no problem recovering these phylogenies. However, these datasets provide a testable range of sequence divergence to assess the sensitivity of alignment-free methods in recovering the # ∗ topologies. For each sequence set, we independently derived pairwise distances using !", !" , !" -./ and !" , in each case across different k-mer lengths (k = 4, 8, 12, 16, 20 and 24). Each parameter setting was run with 100 replicates, i.e. 100 × 3 size categories × 6 trees × 4 methods of !" statistics × 6 k-mer lengths (total of 43200 sequence sets). The same experimental design applies to protein sequences with fixed sequence length of 500 amino acids. See Methods for details.

To compare the performance between MSA-based and the !" methods, we denote a relative measure of accuracy QDX = RFMSA – RFDX, where DX represents any of the !" methods, i.e. QD2 is the Q that corresponds to RFMSA – RFD2, and so forth. Derived from RF, the Q values reflect the proportion of bipartitions in a tree, and can be interpreted as the difference between the deviation of each tree from the common reference. The sign of the Q value indicates which of the two approaches performs better; if a !" method performs better than MSA in recovering the reference tree then Q > 0 (i.e. RFMSA > RFDX), whereas if a !" method performs worse than MSA then Q <

0 (i.e. RFMSA < RFDX). Where Q = 0 (i.e. RFMSA = RFDX) the !" method performs as well as the MSA-based approach, although the trees could still be incongruent with the reference (i.e. their RF could be non-zero).

-./ Across all !" methods used in this study, we found that !" yielded the smallest RF across all categories of size and situations of relative branch length, for both nucleotide (Supplementary Fig.

S1) and protein (Supplementary Fig. S2) sequence sets. Figure 2a shows mean RFD2n1 at different k-mer lengths (shown for k ≥ 8) in each size category N of nucleotide sequence sets, across all trees (T1 through T6; Fig. 1), with the corresponding mean Q value shown in Fig. 2b. Across all -./ N, !" recovered the reference topology almost perfectly for sets of sequences simulated under trees T1, T2, T4 and T6 (at k = 16, mean RFD2n1 ≤ 0.001 across these sets and all N; Fig. 2a), whereas larger RFD2n1 distances are observed for cases of T3 and T5 (e.g. for N = 32 at k = 16, mean RFD2n1 = 0.06 and 0.03 respectively for T3 and T5; Fig. 2a). The accuracy decreased with increasing k, e.g. for N = 128 and T3, mean RFD2n1 = 0.01, 0.03, 0.06, 0.11, 0.18 at k = 8, 12, 16, 20 and 24.

68 p1

p2

p1

p2

Figure 1. Trees for simulation of sequence data. Six situations showing distinct combinations of internal (x) and terminal (y) branches, labelled as T1 through T6, with y specified differently between the first (p1) and second (p2) half of the branches on a tree. The unit of branch lengths is number of substitutions per site. The length of each edge is either 0.01 or 0.05 substitutions per site. a (i) N = 8 (ii) N = 32 (iii) N = 128

k k k 8 8 8 0.3 0.3 0.3 12 12 12 16 16 16 20 T3 20 20 24 24 24 RF RF RF T3 T3 T5 T5 T5 T1 T2 T4 T6 T1 T2 T4 T6 T1 T2 T4 T6 0.0 0.1 0.2 0.4 0.0 0.1 0.2 0.4 0.0 0.1 0.2 0.4 Tree Tree Tree

b (i) N = 8 (ii) N = 32 (iii) N = 128

Tree T5 Tree Tree T1 T2 T4 T6 T1 T2T3 T4T5 T6 T1 T2T3 T4T5 T6

k k k Q value Q value 8 8 Q value T3 8 12 12 12 16 16 16 20 20 -0.3 -0.3 -0.3 20 24 24 24 -0.4 -0.2 -0.1 0.0 0.1 -0.4 -0.2 -0.1 0.0 0.1 -0.4 -0.2 -0.1 0.0 0.1

Figure 2. The accuracy of D2 methods based on sequence divergence of the nucleotide sequence sets. For each size N at (i) 8, (ii) 32 and (iii) 128, mean RFD2n1 are shown in (a) across different k-mer lengths (shown for k = 8, 12, 16, 20, 24), for cases simulated under each of the six trees (T1 through T6 on the x-axis). The corresponding QD2n1 for each case is shown in (b). Error bars indicate standard deviation from the mean. See Figures S1 through S4 for complete results for all !" methods for both nucleotide and protein sequence sets.

While relative performance differed across the simulated scenarios, overall across these sequence

69 -./ sets we find that !" performed as well as the standard MSA-based approach (e.g. for T1 and T2 at k = 8, mean QD2n1 = 0.00 in all cases of N = 8, 32 and 128; Fig. 2b), with the relative performance Q decreasing slightly with increased k (e.g. for N = 32 at T3, Q = -0.01, -0.03, -0.06, -0.11, -0.17 -./ at k = 8, 12, 16, 20 and 24). Across all N examined here, !" performed slightly worse than MSA for T3 and T5, e.g. at k = 8, QD2n1 = -0.01 and -0.02 respectively at N = 32; QD2n1 = -0.02 and - 0.07 respectively at N = 128. The bar plots in Fig. 2a almost mirror those in Fig. 2b, suggesting that RFMSA = 0 in most cases. Both T3 and T5, the cases problematic for !" methods, have short internal branches (x) with long terminal branches (y: Fig. 1). Our results suggest that !" methods are more vulnerable to this situation, while the MSA-based approach performed well across these six cases. Q values observed for other !" methods across nucleotide and protein sequence sets are shown in Supplementary Fig. S3 and S4 respectively.

To assess the optimal k-mer length for use in !" methods in deducing phylogenetic relationships from nucleotide and protein sequences, we compared RF values from all !" methods between the two sequence types across N = 8, 32 and 128 pooled from all six trees, as shown in Supplementary Fig. S5. For nucleotide sequences, k = 8 yielded the lowest RF distances, with RF = 0 at N = 8 and

32, and RF < 0.002 at N = 128 across all !" methods. For protein sequences, k = 4 is the optimal -./ length across all !" methods, with !" yielding the smallest RF distances across all size categories, i.e. RFD2n1 = 0.012, 0.009 and 0.009 at N = 8, 32 and 128. This result supports the notion that optimal k is negatively correlated with alphabet size of the sequence data9,29,30. For equiprobable residues the match probability of two nucleotides, p is the inverse of the alphabet

6 234 ( ) size a; therefore a 1 = 78 in which L being the length of the genome, might give us the 234 (//&) desired relationship. For other residue-probability distributions formal proof appears to be lacking, but might be approached analogously to an earlier study31.

Two other scenarios relevant to sequence divergence are among-site rate heterogeneity (the presence of fast- versus slow-evolving sequence regions), and compositional (G+C content) biases in the sequences. We examined the sensitivity of !" methods independently to each these scenarios (see Supplementary Note for detail). Overall, among-site rate variation does not appear to affect drastically the accuracy of either !" or MSA-based approaches (Q = 0 in most cases at optimal k in Supplementary Fig. S6); the RF values for all analyses of nucleotide and protein sequences are shown respectively in Supplementary Fig. S7 and S8. Interestingly, we note that high G+C proportion (thus low complexity of sequences) plays to the strength of local exact matches, rather -./ than neighbourhood (non-exact) matches as allowed in !" (Supplementary Fig. S9).

70 Genetic rearrangement. Here we simulated sequence data to assess the direct impact of genetic rearrangement on the performance of !" methods in phylogenetic inference. We define R as the percentage length of a full-length nucleotide sequence that has undergone a non-overlapping rearrangement. We simulated post-hoc rearrangements in half of the sequences in a set of 5000-nt sequences, i.e. at N = 8, each of any 4 sequences would have R% of its length rearranged in a non- overlapping manner. Each rearrangement event involves one or more fragments of 250 nt, such that the total rearranged region (i.e. R% of full length) is no longer contiguous (see Methods).

Figure 3a shows the average RFD2n1 for each k-mer length in nucleotide sequence sets (N = 8) across R = 10, 25 and 50%, including RFMSA of the MSA-based approach MUSCLE+MrBayes. Across all categories and all k-mer sizes, all methods, alignment-free or not, yielded average RF -./ < 0.05 compared to the reference tree. !" at k = 8 or 12 perfectly recovered the reference topologies (RFD2n1 = 0 in both cases) regardless of R. Figure 3b shows the mean Q values for each -./ of these cases. At R = 10% and 25%, we observed Q = 0 for k = 8 and 12, i.e. !" performed as well as did the MSA-based approach in recovering the reference topologies. At R = 50%, the !" methods yielded higher accuracy than did MUSCLE+MrBayes (Q > 0 for all k-mer lengths). Compared to MUSCLE (Fig. 3), the use of MAFFT resulted in higher RF and Q values (Supplementary Fig. S10), thus lower accuracy (p < 2.2×10-16; see Supplementary Note). Our findings suggest that !" methods are more robust to the effect of genetic rearrangement than is the standard approach based on MSA.

a k = 8 k = 12 k = 16 R = 25% R = 50% k = 20 k = 24

RF MSA 0.1 0.4 0.0 R = 10% -0.1 0.2 0.3 b k = 8 k = 12 k = 16 R = 25% R = 50% k = 20 k = 24 0.2 0.1 0.3 0.5 Q value 0.0 0.4

R = 10%

-0.2 -0.1

Figure 3. The accuracy of D2 methods based on genetic rearrangement. RFD2n1 were shown in (a) across different k-mer lengths (k ≥ 8), as well as that of the standard approach (RFMSA), across different R at 10%, 25% and 50%. The corresponding QD2n1 values are shown in (b). Error bars indicate standard deviation from the mean.

Insertions/deletions. To assess the sensitivity of the alignment-free approach to 71 insertions/deletions (indels) we simulated nucleotide sequence sets (N = 32) under tree T4 by incorporating indel events at a predefined rate (r) along the tree branches32, with the inserted/deleted fragment lengths following a Lavalette distribution33,34 (maximum length = # 100nt). Figure 4a shows the RF values obtained using !" , two MUSCLE-based methods (MrBayes and the popular ML method RAxML35,36) across cases at different values of r; the corresponding Q values for each MSA-based approach are shown in Fig. 4b. At r = 0.1, all approaches recovered the reference topology perfectly (RF = 0 in all cases). As r increases, observed RF increases # proportionately: for trees generated using !" at r = 0.3, 0.4 and 0.5, RF = 0.001, 0.005 and 0.024. In comparison, the corresponding RF values for MSA-based methods are higher: RF = 0.071, 0.370 and 0.597 for MUSCLE+MrBayes and RF = 0.087, 0.391, and 0.642 for MUSCLE+RAxML. These results suggest that alignment-free methods are more robust to insertions/deletions (RF < 0.025 at r = 0.5) than MSA-based approaches (RF ≥ 0.60 at r = 0.5 in both cases), with all observed Q ≥ 0 (e.g. Q = 0.07, 0.37 and 0.57 at r = 0.3, 0.4 and 0.5 for MUSCLE+MrBayes: Fig. 4b).

a RAxML MrBAYES 0.6 0.8

S D2 , k = 8 RF

0.00.1 0.2 0.2 0.4 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 indel rate indel rate indel rate

b S S D2 D2 versus versus MrBAYES RAxML 0.6 0.8 Q

0.00.1 0.2 0.2 0.4 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 indel rate indel rate Figure 4. The accuracy of phylogenetic approaches based on insertions/deletions. RF values # are shown in (a) for !" , MUSCLE+MrBayes and MUSCLE+RAxML across different indel rates r. The corresponding Q values for MUSCLE+MrBayes and MUSCLE+RAxML are shown in (b). Error bars indicate standard deviation from the mean.

Here the use of MAFFT instead of MUSCLE yielded lower RF and Q values, i.e. a higher accuracy

72 of phylogenetic inference (Supplementary Fig. S11 versus Fig. 4; p < 2.2×10-16). These findings are consistent with our analysis of other insertion/deletion scenarios including vertically staggered deletions (Supplementary Note and Fig. S12), a (biologically not very realistic) scenario in which 37 MSA is known to perform poorly . Independently, we observed that the accuracy of !" methods decreases with increasing extent of sequence truncation, and increases proportionately with sequence length (Supplementary Note and Fig. S13).

Gene family evolution based on coalescence. Here we simulated nucleotide sequence sets under the coalescent model of gene family evolution (within a population)38,39 across different fixed effective population sizes Ne (see Methods). The coalescent rate between two lineages is higher 40 within a smaller population , thus a smaller Ne tends to yields shorter branch lengths in a tree. All trees are asymmetric, and thus represent a more-realistic biological scenario. We note that the observed performance in this part of our analysis could be affected by one or more scenarios in addition to Ne (and sequence divergence), e.g. multiple hits per site, haploid genomes or structure -./ within the populations. Figure 5a shows the RF values obtained using !" , and by MSA-based approaches using MUSCLE, across cases at varied Ne; the corresponding Q values for each MSA- based approach are shown in Fig. 5b. RF > 0 was observed across all cases, suggesting that all approaches on average failed to recover known tree topologies perfectly. Observed RF values for -./ all approaches increase proportionately with increasing Ne when Ne ≥ 100000, e.g. for !" , RF

= 0.072, 0.119, 0.239 and 0.407 at Ne = 100000, 250000, 500000 and 1000000 (Fig. 5a), suggesting an inverse relationship between Ne and the accuracies of these approaches in recovering the known -./ tree topology. At Ne = 10000, 100000 and 250000, both !" and MSA-based approaches yielded almost identical trees (e.g. Q = -0.007, -0.010, -0.016 against MUSCLE+RAxML; Fig. 5b), -./ -./ although !" yielded less-accurate topologies (Q < 0). In the extreme cases of Ne > 250000, !" performed substantially worse than any of the two MSA-based methods, e.g. Q = -0.146 and -

0.279 for MUSCLE+RAxML (Fig. 5b). At the other end of the spectrum, cases of small Ne = 1000 also negatively impacted the accuracies of all approaches, i.e. RF = 0.240, 0.230 and 0.213 for -./ !" , MUSCLE+MrBayes and MUSCLE+RAxML (Fig. 5a). Results of the corresponding analysis using MAFFT are shown in Supplementary Figure S14 (p = 0.74; no significant difference). These findings indicate that in these scenarios, the alignment-free approach yields results similar to those of the MSA-based approaches, regardless of which MSA tool is used, when

Ne is reasonably large, but performs substantially worse in extreme cases i.e. when Ne is very small or very large. This observation is plausibly explained by extreme (high/low) sequence divergence (See Supplementary Table S1), although we cannot rule out the impact of other evolutionary scenarios. In an independent analysis across datasets that were simulated under non-ultrametric

73 trees (specifically violating the molecular clock) we observed a similar trend (RF > 0; Q < 0), with -./ higher RF observed for !" than for MSA-based approaches (Supplementary Fig. S15). This complex scenario is more realistic than ultrametric trees, but we cannot distinguish the effect of clock violation from that of other evolutionary processes.

a n=1 1K D2 , k = 10 MrBAYES 10K 100K 0.4 0.5 250K 500K 1000K

RF RAxML 0.1 0.0 0.2 0.3 b 0.0 0.1

n=1 D2

Q versus RAxML n=1 D2 versus 1K MrBAYES 10K

-0.3 -0.2 -0.1 100K 250K 500K 1000K -0.4

Figure 5. The accuracy of phylogenetic approaches based on coalescent evolution of gene -./ families. RF values are shown in (a) for !" , MUSCLE+MrBayes and MUSCLE+RAxML across different effective population size Ne. The corresponding Q values for MUSCLE+MrBayes and MUSCLE+RAxML are shown in (b). Error bars indicate standard deviation from the mean.

3.2.3.2 Empirical data

To examine the performance of these methods with empirical data, we used 4156 sets of nucleotide sequences and their corresponding phylogenetic trees from TreeBASE (treebase.org)41. These sequence sets and trees were obtained from 2471 studies deposited in TreeBASE as of 27 May 2013 (see Supplementary Data for the complete list). As shown in Supplementary Fig. S16, the sizes of these sequence sets range between 6 and 2957 sequences (mean 59.41, median 41 sequences), and within-set sequence similarity has a mean of 90.12% (median 92.37%). For each sequence set, we used each of the !" methods (independently for k = 6 and 8) to generate a distance matrix, from which we reconstructed a NJ tree. The selection of k is based on our observation of an optimal length in the analysis of simulated nucleotide sequence sets (Supplementary Fig. S5). Because the true reference tree is unknown for empirical datasets, we cannot readily assess

74 accuracy. Here we compare each of our resulting test trees inferred using the !" methods against the corresponding tree published (and peer-reviewed) in TreeBASE. Because we cannot assume that published trees perfectly reflect true evolutionary relationships, we intentionally do not interpret RF as a measure of accuracy here, but instead simply as a measure of (dis)agreement between the trees produced by an alignment-free and an MSA-based approach.

As shown in Supplementary Table S2, the use of k = 6 versus 8 does not impact RF for any !" # method, with !" yielding the smallest average RF (0.438; median 0.409 at k = 8). Figure 6 shows # the distribution density of RF as observed for !" at k = 8, based on sizes of the sequence sets N (Fig. 6a) and within-set sequence similarity (Fig. 6b). See Supplementary Tables S3 and S4 # respectively for the corresponding values. As shown in Fig. 6a and Supplementary Table S3, !" yielded topologies that are more congruent with those generated using the standard MSA approach for small sequence sets (e.g. mean RF 0.363, median 0.333 at N ≤ 25) than for larger sequence sets of N > 25 (mean RF 0.661, median 0.635 at N > 500), and these RF distances increase proportionately with increasing N. Interestingly, across different categories of within-set sequence similarity (percent identity; ID) regardless of N (Fig. 6b), density plots of RF for cases of ID > 70% peak at values of RF between 0.25 and 0.40, with the smallest means observed for highly similar sequence sets (0.424 at ID between 80% and 90%, median 0.392; Supplementary Table S4). RF values increase with decreasing ID, with mean RF 0.533, median 0.528 observed for cases of ID < 70% (Supplementary Table S4). These findings suggest that the !"-based approach, across most of these diverse empirical data, yield topologies that are slightly incongruent (RF < 0.5 in # 2809/4156 trees; !" at k = 8) to those arising from the standard MSA-based approach, and that it is rare for both approaches to recover the exact same tree topology (RF = 0 recovered by any !"- based approach in 106/4156 trees).

75 a N <25 26-50 51-75 76-100 101-200 201-500 >501 2.0 Density 1.0 0.0 3.0

0.000.25 0.50 0.75 1.00 RF b ID (%) ID > 90 80 < ID ≤ 90 2.0 70 < ID ≤ 80 ID ≤ 70 Density 1.0 0.5 0.0 1.5 0.000.25 0.50 0.75 1.00 RF

Figure 6. The accuracy of D2 methods based on TreeBASE data. The probability density of RFD2n1 at k = 8 as categorised based on (a) total number of sequences within a set, N (mean and median in Table S3), and (b) within-set sequence similarity, ID (mean and median in Table S4).

3.2.3.3 Computational efficiency and scalability

24 The computational complexity of various !" methods has been described earlier (see also

Supplementary Note). Figure 7a shows the computation time required to generate pairwise !" distance matrices across large empirical sequence sets (N = 1000, 2000, 3000, 4000 and 5000); for the corresponding numerical values see Supplementary Table S5. These large sequence sets are of 16S ribosomal RNA genes sampled from the GreenGenes database (see Methods). Mean computation time increases with N, from 49.77 seconds at N = 1000 to 842.98 seconds at N = 5000 (17-fold increase). Similarly, memory usage (Supplementary Table S5) increases with N, from 378.24 MB (N = 1000) to 2445.31 MB (N = 5000; approximately 6-fold increase).

Phylogenetic inference involves details not only of software (e.g. !" and neighbor in PHYLIP

76 versus MUSCLE and MrBayes) but also of parameter settings, implementation (e.g. programming language used, and capacity for multi-threading) and hardware (e.g. machine architecture and its efficiency of memory usage). Therefore, comparing computation time and memory usage between the two approaches is not straightforward. For 50 sets of nucleotide sequence (N = 8; L = 1500nt), we observe an average wall time of 1.50, 86.38 and 491.16 seconds for !"+neighbor, MUSCLE+RAxML and MUSCLE+MrBayes (four-threaded runs; see Methods). For the same analysis across protein sequence sets (N = 8; L = 500aa), wall times are respectively 1.82, 255.48 and 3047.14 seconds. Here, our alignment-free approach is approximately 140-fold and 1670-fold faster respectively, compared to MUSCLE+RAxML and MUSCLE+MrBayes. These findings suggest that !" methods are highly scalable for phylogenetic inference of large-scale sequence data.

a 1000 CPU time (sec) 0200400600800 10002000 3000 4000 5000 Number of sequences (N) b 120 140 80 CPU time (sec) 20 40 100 0 60 14523 Neighbourhood (n)

Figure 7. Computation time of D2 methods. The computation time in seconds is shown for (a) !" method at k = 8 across subset of GreenGenes data across datasets of N = 1000, 2000, 3000, - 4000 and 5000, and for (b) !" analysis across neighbourhood size n = 1 through 5, for nucleotide sequence sets of N = 8. Error bars indicate standard deviation from the mean.

In an independent experiment on nucleotide sequence sets of N = 8 (Fig. 7b), we found that

77 -./ computation time for !" (at k = 8) increases exponentially with increasing neighbourhood n, from 0.71 at n = 1 to 124.73 seconds at n = 5. At greater values of neighbourhood (n > 2) i.e. when - a higher number of wildcards is considered, the accuracy of !" appears to decrease, more so at larger N (Supplementary Fig. S17; shown for k = 8 across nucleotide sequence sets). However, the interplay among n, k and N remains to be investigated systematically.

3.2.4 Discussion

Alignment-free methods yielded similar if not identical tree topologies to those generated using MSA-based approaches across a wide range of data sizes and scenarios. Our findings demonstrate that the accuracy of alignment-free methods, compared to the current standard based on MSA, is more robust against among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions, but is more sensitive to sequence divergence and the presence of incomplete (truncated) sequence data. The alignment-free methods operated at far greater computation speed (more than 2000 times faster in some cases).

Opposing views have recently been expressed on whether the application of alignment-free methods in phylogenetics reflects a model-free, purely informatic exercise, or alternatively can capture homology signal inherent in evolving sequences42-44. Our results support the latter view. The alignment-free approach implemented here appears to have no difficulty, at appropriate parameter settings across our simulated datasets, in capturing homology signal and generating topologies that are very similar or identical to those generated by MSA followed by Bayesian inference, arguably the current standard in phylogenetics (see below). The robustness of alignment-free methods to rearrangements and insertions/deletions represents a critical advantage, since these events are common among microbial genomes3 and frequently interrupt individual genes45. Our findings support the notion that gappy regions tend to be forced into alignment within an MSA framework and thereby bias subsequent phylogenetic inference37.

Here we used MUSCLE26 and MrBayes27 as the standard phylogenetic approach in the analysis of simulated data. Another popular MSA tool is MAFFT46; both MUSCLE and MAFFT compare favourably against other MSA tools in a number of benchmark studies26, 47. A comprehensive analysis of performance across different MSA tools is beyond the scope of this study. Across scenarios of random insertions/deletions, we found little difference in our inference between the use of MUSCLE and MAFFT (p > 0.5; Supplementary Table S6), except under the unrealistic scenarios of vertically staggered deletions (Supplementary Fig. S12; p < 2.2 × 10-16) in which MAFFT performed better, lending support to an earlier report37. The use of other programs for MSA and phylogenetic inference, or indeed the use of different parameter settings in these

78 programs (e.g. fewer MCMC generations in MrBayes than the 1.5 million used in this study), would inevitably yield somewhat different results. ML is another popular MSA-based method of phylogenetic inference, which estimates goodness-of-fit of sequence data given an underlying evolutionary (substitution) model. ML methods e.g. RAxML35 are time-consuming, and this has prompted the development of faster though less-accurate implementations e.g. PhyML21 and/or scalable methods that approximate ML estimates e.g. FastTree22 (see ref. 48 for a comparative analysis). We generated ML trees for a subset of the simulated sequence data using RAxML and found no or little topological difference between these trees and those generated using MrBayes, as shown by the similar trends of RF and Q in Figs 4 and 5. In fact, RAxML yielded less-accurate topologies than MrBayes in many cases (larger RF observed for RAxML: Fig. 5).

Using extensive simulated data and diverse empirical data (here from the TreeBASE dataset, generated by various programs and phylogenetic inference methods common in the peer-reviewed literature), our results consistently demonstrate the relative accuracy and scalability of alignment- free methods in large-scale phylogenetic inference, regardless of which specific method they were compared against. The empirical datasets used in this study are highly diverse, with various extents of within-set sequence divergence and data sizes. Many of these sequence sets contain partial and/or fragmented sequences (Supplementary Data). As per our analysis of simulated sequence sets, these aspects impact the accuracy of alignment-free methods more than that of MSA-based approach in recovering accurate phylogenies. In addition, we applied k = 6 and 8 in our alignment- free approach across these datasets, a decision based on our observation in simulated sets of 1500nt sequences (Supplementary Fig. S5). In cases where sequences are longer, the representation of distinct k-mers (at k = 6 or 8) could be saturated, thus losing the resolution (reducing the distinguishing power of the k-mers) necessary to accurately infer dissimilarity (vis-à-vis phylogenetic) relationships among the sequences9,30. The correlation between sequence length and k within the context of phylogenetics has been explored to some extent30,49, e.g. using shortest unique substrings50, but this issue remains to be systematically investigated. In this study we used

NJ to infer phylogenetic trees from the distance matrices generated from !" methods; one can imagine using other distance-based approaches, e.g. a weighted least-squares method such as Fitch-Margoliash51. In small-scale investigations, we find no topological difference across trees generated using NJ or Fitch-Margoliash.

Conversion of subsequence similarity (profile) scores into a measure that represents the evolutionary relatedness between two full-length sequences remains an active field of research.

Here we simply transformed !" scores into pairwise distances of sequences using a logarithmic representation of the geometric mean. Other strategies have been proposed to create more-realistic

79 measure of distance or dissimilarity, including the assignment of a p-value for each pairwise score based on a null distribution (hypothesis) of subsequences as observed across the whole dataset29, 52. Approaches inspired by information retrieval are under consideration.

In general, our results demonstrate the utility and robustness of alignment-free methods across the choice of scoring methods. The non-monotonic relationship between word length and # ∗ -./ performance, the utility of !" , !" and !" , and the failure of larger mismatch neighbourhoods 18, 52 are broadly consistent with previous reports . However, simple !" scoring is known to be dominated by single-sequence noise effects as k increases18; its good performance here may in part be explained by the normalisation inherent in our distance measure. The one exception to these ∗ comments lies in the vulnerability of !"-based approaches to heterogeneous variation, an effect especially pronounced for protein sequences (Supplementary Fig. S6), which may arise from the failure of the variance estimate in the denominator.

Crucially, the computational advantages identified above extend to a broad range of scoring methods and distance transformations. The use of a mismatch neighbourhood has potential to add significantly to both the compute and memory requirements of the process, but these demands are -./ modest for !" and larger neighbourhoods seem not to improve its performance in phylogenetic inference. Alignment-free methods thus offer computational speed many hundreds or thousands of times faster than the comparable MSA-based approaches, with memory requirements in the hundreds of megabytes, well within the capabilities of even portable commodity devices. To the extent that memory is not an issue, alignment-free methods present an attractive, highly scalable alternative to MSA-based methods in large-scale phylogenetic (and phylogenomic) analyses.

3.2.5 Methods

Simulated sequence data. For all programs, default settings were used unless otherwise specified. We simulated sets of DNA and protein sequences of different sizes (N = 8, 16, 32, 128) using evolver as implemented in PAML 4.553, unless otherwise specified. We used GTR54 (rate parameters a=0.987, b=0.11, c=0.218, d=0.243, e=0.395)55 and WAG56 substitution models respectively for simulation of nucleotide and protein sequences. We detail simulation strategy for each evolutionary scenario below.

Sequence divergence. For each set, sequences of fixed length (L = 1500 nt for DNA; 500 amino acids for protein) were simulated on an unrooted symmetrical tree on which the lengths of internal

(x) and terminal (y, or y1 and y2) branches are set separately, at either 0.01 or 0.05 substitutions per site, to represent six distinct scenarios (Fig. 1; shown for 8-taxon trees). These sequence sets were

80 simulated under a discrete approximation of the gamma distribution (shape parameter 0 = 1.0, 8 categories).

Genetic rearrangement. For each nucleotide sequence set (N = 8; L = 5000 nt), we relocated one or more region (i.e. individual rearrangement events) of 250 nt within a sequence in a cut-and- paste manner, with no overlaps. We define R as the total percentage length of L that has been relocated. We simulated sequence sets with R = 5, 10, 25 and 50% (each in 50 replicates), such that the total rearranged region is not contiguous. Given the prior expectation that alignment-free methods would be less sensitive to sequence rearrangements, here we simulated sequence sets under tree T3 (Fig. 1), one of the more problematic cases for !" methods (as shown in Fig. 2).

Insertions/deletions. For this analysis, we simulated nucleotide sequence sets of size N = 32 (L = 1500 nt) using INDELible32 under tree T4 (Fig. 1), a discrete approximation of the gamma distribution (0 = 1.0, 8 categories) and GTR model. Indel rates were set at 0.1, 0.2, 0.3, 0.4 and 0.5, with insertion rate = deletion rate; these rates are relative to site substitution rate of 1. Length distribution of inserted/deleted fragments follows a Lavalette distribution33,34 (a=1.1; maximum indel size 100nt) as implemented in INDELible32.

Coalescent model of gene family evolution. We used NetRecodon57 to simulate gene family evolution under the coalescence model along a tree, each case at a defined effective population size (Ne) of 1000, 10000, 100000, 250000 and 500000, with a discrete approximation of the gamma distribution (0 = 0.5, 8 categories), GTR model and mutation rate u = 10-5. Sequence sets of size

N = 32 (L = 1500nt) were used. Larger Ne values result in longer branch lengths on a tree (see Supplementary Table S1). To simulate violation of molecular clock, relaxed branch lengths were further simulated on these trees using BranchRelaxer in GenPhyloData58, with substitution rates along branches were modelled as independent and identically distributed variables in a log-normal scale (IIDLogNormal model: mean 0.0, variance 1.0)59. Sequences were then simulated using evolver along these new trees as per above.

Empirical sequence data. All 2471 nucleotide datasets in NEXUS format were downloaded from TreeBASE (treebase.org as of 27 May 2013)41 using a custom script kindly provided by Dr William Piel. For each dataset, one or more nucleotide sequence alignment and their corresponding phylogenetic trees (totalling 4156) were extracted (Supplementary Data). All 406997 unaligned 16S ribosomal RNA gene sequences (sequences_16S_all_gg_2011_1_unaligned.fasta.gz)60 were downloaded from the GreenGenes database (secondgenome.com/go/2011-greengenes-taxonomy).

To assess scalability of !" methods on different sizes of sequence sets, these 406997 sequences were randomly selected across set N = 1000, 2000, 3000, 4000 and 5000, each in 100 replicates. 81 We follow ref. 61 in defining within-set sequence similarity as the average pairwise similarity between each sequence in a set to the centroid sequence. A centroid sequence within a set is one that yielded the single highest bit score across all pairwise comparisons within the set using BLAST (e < 10-3).

Alignment-free phylogenetic approach. For each sequence set, we used !" statistics # ∗ -./ independently for !", !" , !", and !" to generate a score for each possible pair of sequences within a set (see Supplementary Note for details). These scores were transformed via logarithmic representation of the geometric mean to generate a distance. The pairwise distance between sequences a and b, Dab is defined as

;&' !&' = ln ;&&×;''

where Sab is the pairwise score between them, and Saa and Sbb are the self-matching scores. These transformed pairwise distances closely approximate the angle-based distances in an earlier alignment-free method for inferring protein phylogenies62. The resulting distance matrix was used to reconstruct a phylogenetic tree using neighbor in PHYLIP v3.69 (evolution.genetics.washington.edu/phylip). Generation of the distance matrix from any of these

!" methods is implemented in a JAVA program, JIWA, which is freely available at bioinformatics.org.au/tools-data/.

Standard phylogenetic approach using multiple sequence alignment. For each sequence set, we used MUSCLE v3.8.326 to generate a multiple sequence alignment. For scenarios of genetic rearrangement, insertions/deletions and the coalescent model, we also used MAFFT (mafft-linsi) v7.158b46. For other simulated scenarios, alignments were perfectly given during the process of simulation; the use of any MSA tool would not yield any difference in the final alignments. For Bayesian phylogenetic inference, we used MrBayes v3.2.127 (MCMC ngen=1500000 generations, samplefreq=100, burn-in=10000 samples, temp=0.5, nchains=4; sumt contype=allcompat). We assume the general reversible substitution model (lset Nucmodel=4by4 Nst=6) and a mixed amino acid substitution model (prset aamodel=mixed) respectively for nucleotide and protein sequences, under a four-category discrete gamma distribution across all runs (lset rate=gamma ngammacat=4). In all cases except the insertions/deletions analysis, the standard deviation of split frequencies was < 0.01 after 200000 generations. For insertions/deletions analysis, MrBayes was run at larger number of MCMC generations (ngen=5000000) and burnin (samplefreq=100, burn- in=25000 samples), while other parameters remain the same. The standard deviation of split frequencies in most cases was < 0.01 after 1000000 generations. For maximum likelihood 82 inference of phylogenetic trees, we used RAxML v8.0.236 (-# 100, -t 4, -m GTRGAMMA or PROTGAMMAWAG respectively for nucleotide and protein sequences).

Assessment of accuracy. For each tree generated from a sequence set using !" statistics or the standard approach, we compared its topological congruence to a reference tree using the Robinson- Foulds distance28, as implemented in treedist in PHYLIP v3.69. This distance represents the number of splits (i.e. bipartitions) that are present in only one of the two trees. To facilitate comparison of our results across trees (i.e. sequence sets) of various sizes N, we normalised the distances by the maximum possible distance between two unrooted trees, 2(N - 3), following ref. 63. Here we denote RF as the normalised Robinson-Foulds distance, with a value between 0 and 1 that can be interpreted as the proportion of false or missing bipartitions in the test tree topology compared to the reference topology63. When RF = 0, the test and reference topologies are identical, suggesting high accuracy of the approach. When RF = 1, none of the bipartitions in the reference is recovered in the test. In these cases, the trees could have been generated at random, as a pair of randomly generated tree topologies of N taxa has a Robinson-Foulds distance that approximates the denominator for normalisation, 2(N - 3)64. For the simulated data, we used the known tree (under which the sequences were simulated) as the reference. For empirical data from TreeBASE we used the published tree in the database as reference; in these cases, a zero RF does not relate directly to accuracy, but rather reflects the extent to which our method recovers the same topology as the published method based on multiple sequence alignment.

Assessment of computational scalability and runtime. The assessment of computational scalability was carried out using a high-performance distributed-memory computing cluster based on Intel Sandy Bridge 8-core 2.6 GHz processors. Comparative runtime analysis of alignment-free and MSA-based phylogenetic approaches was done on Intel Xeon L5520 8-core 2.26 GHz processors (multi-threaded, four threads). MCMC ngen=1500000 was used for MrBayes runs.

3.2.6 References

1 Edgar, R.C. & Batzoglou, S. Multiple sequence alignment. Curr. Opin. Struct. Biol. 16, 368-373 (2006).

2 Notredame, C. Recent evolutions of multiple sequence alignment algorithms. PLoS Comput. Biol. 3, 1405-1408 (2007).

3 Darling, A.E., Miklos, I. & Ragan, M.A. Dynamics of genome rearrangement in bacterial populations. PLoS Genet. 4, e1000128 (2008).

83 4 Puigbò, P., Wolf, Y.I. & Koonin, E.V. The tree and net components of prokaryote evolution. Genome Biol. Evol. 2, 745-756 (2010).

5 Zhaxybayeva, O. & Doolittle, W.F. Lateral gene transfer. Curr. Biol. 21, R242-246 (2011).

6 Wong, K.M., Suchard, M.A. & Huelsenbeck, J.P. Alignment uncertainty and genomic analysis. Science 319, 473-476 (2008).

7 Wu, M.T., Chatterji, S. & Eisen, J.A. Accounting for alignment uncertainty in phylogenomics. PLoS ONE 7, e30288 (2012).

8 Chan, C.X. & Ragan, M.A. Next-generation phylogenomics. Biol. Direct 8, 3 (2013).

9 Höhl, M. & Ragan, M.A. Is multiple-sequence alignment required for accurate inference of phylogeny? Syst. Biol. 56, 206-221 (2007).

10 Höhl, M., Rigoutsos, I. & Ragan, M.A. Pattern-based phylogenetic distance estimation and tree reconstruction. Evol Bioinform Online 2, 359-375 (2006).

11 Domazet-Lošo, M. & Haubold, B. Alignment-free detection of local similarity among viral and bacterial genomes. Bioinformatics 27, 1466-1472 (2011).

12 Vinga, S. & Almeida, J. Alignment-free sequence comparison - a review. Bioinformatics 19, 513-523 (2003).

13 Bonham-Carter, O., Steele, J. & Bastola, D. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Brief. Bioinform., In Press, DOI:10.1093/bib/bbt1052 (2013).

14 Haubold, B. Alignment-free phylogenetics and population genetics. Brief. Bioinform. 15, 407-418 (2014).

15 Song, K. et al. New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief. Bioinform. 15, 343-353 (2014).

16 Torney, D.C., Burks, C., Davison, D. & Sirotkin, K.M. in Computers and DNA - Santa Fe Institute Studies in the Sciences of Complexity Vol 7. (eds. G. Bell & R. Marr) 109-125 (Addison- Wesley, Reading, MA; 1990).

17 Wan, L., Reinert, G., Sun, F. & Waterman, M.S. Alignment-free sequence comparison (II):

84 theoretical power of comparison statistics. J Comput. Biol. 17, 1467-1490 (2010).

18 Reinert, G., Chew, D., Sun, F. & Waterman, M.S. Alignment-free sequence comparison (I): statistics and power. J Comput. Biol. 16, 1615-1634 (2009).

19 Hide, W., Burke, J. & Davison, D.B. Biological evaluation of d2, an algorithm for high- performance sequence comparison. J Comput. Biol. 1, 199-215 (1994).

20 Miller, R.T. et al. A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. Genome Res. 9, 1143-1155 (1999).

21 Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307-321 (2010).

22 Price, M.N., Dehal, P.S. & Arkin, A.P. FastTree 2 – approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).

23 Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 215, 403-410 (1990).

24 Göke, J., Schulz, M.H., Lasserre, J. & Vingron, M. Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics 28, 656-663 (2012).

25 Yi, H. & Jin, L. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res. 41, e75 (2013).

26 Edgar, R.C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792-1797 (2004).

27 Ronquist, F. et al. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst. Biol. 61, 539-542 (2012).

28 Robinson, D.F. & Foulds, L.R. Comparison of phylogenetic trees. Math. Biosci. 53, 131- 147 (1981).

29 Forêt, S., Wilson, S.R. & Burden, C.J. Empirical distribution of k-word matches in biological sequences. Pattern Recognit. 42, 539-548 (2009).

85 30 Forêt, S., Kantorovitz, M.R. & Burden, C.J. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences. BMC Bioinformatics 7 Suppl 5, S21 (2006).

31 Huffman, D.A. A method for the construction of minimum-redundancy codes. Proc. IRE 40, 1098-1101 (1952).

32 Fletcher, W. & Yang, Z. INDELible: a flexible simulator of biological sequence evolution. Mol. Biol. Evol. 26, 1879-1888 (2009).

33 Lavalette, D. Facteur d’impact: impartialité ou impuissance? (INSERM U350 Institut Curie - Recherche Bât. 112, Centre Universitaire, Orsay, France; 1996).

34 Popescu, I.I. On a Zipf's Law extension to impact factors. Glottometrics 6, 83-93 (2003).

35 Stamatakis, A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688-2690 (2006).

36 Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312-1313 (2014).

37 Golubchik, T., Wise, M.J., Easteal, S. & Jermiin, L.S. Mind the gaps: evidence of bias in estimates of multiple sequence alignments. Mol. Biol. Evol. 24, 2433-2442 (2007).

38 Kingman, J.F.C. The coalescent. Stoch. Proc. Appl. 13, 235-248 (1982).

39 Tellier, A. & Lemaire, C. Coalescence 2.0: a multiple branching of recent theoretical developments and their applications. Mol. Ecol. 23, 2637-2652 (2014).

40 Sjödin, P., Kaj, I., Krone, S., Lascoux, M. & Nordborg, M. On the meaning and existence of an effective population size. Genetics 169, 1061-1070 (2005).

41 Piel, W.H., Donoghue, M.J. & Sanderson, M.J. in To the interoperable "Catalog of Life" with partners Species 2000 Asia Oceania. NIES Research Report., Vol. 171. (eds. J. Shimura, K.L. Wilson & D. Gordon) 41-47 (National Institute for Environmental Studies, Tsukuba, Japan; 2002).

42 Posada, D. Phylogenetic models of molecular evolution: next-generation data, fit, and performance. J. Mol. Evol. 76, 351-352 (2013).

43 Ragan, M.A. & Chan, C.X. Biological intuition in alignment-free methods: response to

86 Posada. J. Mol. Evol. 77, 1-2 (2013).

44 Ragan, M.A., Bernard, G. & Chan, C.X. Molecular phylogenetics before sequences: Oligonucleotide catalogs as k-mer spectra. RNA Biol. 11, 176-185 (2014).

45 Chan, C.X., Darling, A.E., Beiko, R.G. & Ragan, M.A. Are protein domains modules of lateral genetic transfer? PLoS ONE 4, e4524 (2009).

46 Katoh, K. & Standley, D.M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772-780 (2013).

47 Thompson, J.D., Linard, B., Lecompte, O. & Poch, O. A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives. PLoS ONE 6, e18093 (2011).

48 Liu, K., Linder, C.R. & Warnow, T. RAxML and FastTree: comparing two methods for large-scale maximum likelihood phylogeny estimation. PLoS ONE 6, e27731 (2011).

49 Gunasinghe, U., Alahakoon, D. & Bedingfield, S. Extraction of high quality k-words for alignment-free sequence comparison. J. Theor. Biol. 358, 31-51 (2014).

50 Haubold, B. & Pfaffelhuber, P. Alignment-free population genomics: an efficient estimator of sequence diversity. G3 2, 883-889 (2012).

51 Fitch, W.M. & Margoliash, E. Construction of phylogenetic trees. Science 155, 279-284 (1967).

52 Burden, C.J., Kantorovitz, M.R. & Wilson, S.R. Approximate word matches between two random sequences. Ann. Appl. Probab. 18, 1-21 (2008).

53 Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586-1591 (2007).

54 Tavaré, S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 17, 57-86 (1986).

55 Yang, Z. Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39, 105-111 (1994).

56 Whelan, S. & Goldman, N. A general empirical model of protein evolution derived from

87 multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 18, 691-699 (2001).

57 Arenas, M. & Posada, D. Coalescent simulation of intracodon recombination. Genetics 184, 429-437 (2010).

58 Sjöstrand, J., Arvestad, L., Lagergren, J. & Sennblad, B. GenPhyloData: realistic simulation of gene family evolution. BMC Bioinformatics 14, 209 (2013).

59 Drummond, A.J., Ho, S.Y., Phillips, M.J. & Rambaut, A. Relaxed phylogenetics and dating with confidence. PLoS Biol. 4, e88 (2006).

60 McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 6, 610-618 (2012).

61 Chan, C.X., Mahbob, M. & Ragan, M.A. Clustering evolving proteins into homologous families. BMC Bioinformatics 14, 120 (2013).

62 Stuart, G.W., Moffett, K. & Baker, S. Integrated gene and species phylogenies from unaligned whole genome protein sequences. Bioinformatics 18, 100-108 (2002).

63 Kupczok, A., Schmidt, H. & von Haeseler, A. Accuracy of phylogeny reconstruction methods combining overlapping gene data sets. Algorithms Mol. Biol. 5, 37 (2010).

64 Bryant, D. & Steel, M. Computing the distribution of a tree metric. IEEE/ACM Trans. Comput. Biol. Bioinform. 6, 420-426 (2009)

88 3.3 Concluding remarks

This chapter demonstrates the accuracy and scalability of AF approaches to infer phylogenies of evolving sequences. Compared to an MSA approach, my results demonstrate that AF methods are more robust against among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions, but are more sensitive to recent sequence divergence and sequence truncation. Across diverse empirical datasets, these AF methods perform well even when applied to bacterial datasets, while preserving great computational speed. These performances suggest that AF approaches could be used to infer phylogenies from whole-genome data.

89 CHAPTER 4: CAN ALIGNMENT-FREE APPROACHES ACCURATELY BE USED TO INFER PHYLOGENIES OF MICROBIAL GENOMES?

In Chapter 3, I demonstrated that AF approaches can infer accurate phylogenies of molecular sequences at high computational efficiency. However, inferring phylogenies from whole-genome sequences is more complicated due to common evolutionary scenarios that violate sequence contiguity, such as LGT and genome rearrangement. Despite the growing number of AF methods available, their performance accuracy in inferring phylogenies from whole-genome sequences remains little known.

This chapter is presented in the form of a manuscript, addressing Aims 2 and 3 (Chapter 1, section 1.5). This manuscript presents a systematic investigation, using both simulated and empirical sequence data, of the sensitivity and scalability of nine AF methods to genome-scale evolutionary events, specifically genome divergence, LGT and rearrangement. These methods represent the two families of AF methods, those based on word-counts (with exact or inexact k-mers), and those based on match-lengths (with or without mismatches). This work was published in Scientific Reports (doi: 10.1038/srep28970). As the first author of this paper, I designed and conducted all experiments, performed all the associated analyses, and prepared the manuscript text, including all associated figures and tables. The supplementary material for this manuscript is presented in Appendix B.

90 4.1 Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer

4.1.1 Abstract Alignment-free (AF) approaches have recently been highlighted as alternatives to methods based on multiple sequence alignment in phylogenetic inference. However, the sensitivity of AF methods to genome-scale evolutionary scenarios is little known. Here, using simulated microbial genome data we systematically assess the sensitivity of nine AF methods to three important evolutionary scenarios: sequence divergence, lateral genetic transfer (LGT) and genome rearrangement. Among these, AF methods are most sensitive to the extent of sequence divergence, less sensitive to low and moderate frequencies of LGT, and most robust against genome rearrangement. We describe the application of AF methods to three well-studied empirical genome datasets, and introduce a new application of the jackknife to assess node support. Our results demonstrate that AF phylogenomics is computationally scalable to multi-genome data and can generate biologically meaningful phylogenies and insights into microbial evolution.

4.1.2 Introduction Phylogenomics is the study of evolutionary relationships via comparative analysis of genome-scale data, intersecting the fields of evolution and genomics. Phylogenomic methods are used to assess diverse biological hypotheses, e.g. the emergence and spread of antibiotic resistance among bacterial pathogens1, or the tree of life2. Evolutionary relationships are inferred based on sequence homology in a comparative gene-by-gene3, concatenated multi-gene4 or whole-genome approach5. Classical phylogenomic approaches consist of three main steps: clustering of homologous sequences, multiple sequence alignment (MSA) of each cluster, and inference of a phylogenetic tree based on the alignment, e.g. using maximum likelihood (ML) or Bayesian methods. MSA is a crucial step in these approaches. In carrying out MSA we implicitly assume that by inserting gaps and sliding blocks of sequence relative to each other, we can generate a position-wise hypothesis of homology across the entire length of the sequences. In reality, this may be unrealistic because genes and genomes are subject to recombination, inversion, rearrangement and lateral genetic transfer6.

These processes are intensified in most microbial genomes, where LGT, inversions and rearrangement are rampant5,7,8. These events violate models of nucleotide substitution across sequence positions and lineages, commonly specified in MSA-based approaches, and thus bias subsequent phylogenetic inference9,10. In addition, increasing numbers of microbial genomes are now available both from a broad taxonomic breadth11,12 and within individual taxa13. MSA does not scale well with genome-scale data, such that classical approaches will soon be impractical for large-scale

91 comparative genome analysis.

Alternatively, alignment-free (AF) methods, based on comparison of sub-sequences of defined length (known variously as k-mers or n-grams) instead of full-length sequences, can be used14-16. AF methods utilise subsequences (k-mers) as the basis for calculating an overall pairwise distance between one sequence and another. These methods (Figure 1) fall into two broad categories15: those based on the number or frequency of k-mers shared between two sequences, and those based on the length of matching k-mers. Furthermore the k-mers may or may not be required to match exactly. In general, match-length methods perform well in the comparison of highly similar sequences, due to the large proportion of exact matches. Algorithmically, most AF methods show linear-time complexity, but word-count methods are usually faster and, depending on the implementation, can require less memory than match-length methods.

Exact ffp

Word cvt Count co-phylog Inexact spaced AF methods LZ-factors gram

Match acs Length Exact Common kr Substrings Inexact kmacs

Figure 1: Alignment-free methods classification. Classification of alignment-free methods, 15 modified following Haubold . LZ: Lempel-Ziv.

Recent studies demonstrate the potential of AF approaches in the accurate inference of phylogenetic relationships15,17, with several methods more robust than MSA-based approaches against gene-level evolutionary scenarios including among-site rate heterogeneity, compositional biases, genetic recombination and insertions/deletions. The AF methods assessed were more sensitive to recent sequence divergence and sequence truncation17. Across diverse empirical gene-family datasets, AF approaches perform well for sequences sharing low divergence, at greater computation speed. 92 However, the sensitivity of AF approaches to evolutionary scenarios at whole-genome level including genome rearrangements and LGT, and their scalability to multi-genome data, remain to be systematically investigated.

Classical phylogenetic methods employ heuristics, and iterative sampling strategies have been adopted to assess node support, among which the best-known are the non-parametric bootstrap18 in ML and the posterior probability of tree bipartitions in Bayesian inference19,20. In comparison, AF approaches provide exact solutions with no iterative processes, with the consequence that phylogenetic trees generated using AF approaches lack estimates of node support. Bootstrap and subsampling techniques have been proposed in recent studies21-23, but most studies of AF approaches focused only on topologies, with no or little emphasis on node support.

Here, using both simulated and empirical data we systematically assess the sensitivity of nine existing AF methods to genome-scale evolutionary scenarios involving sequence divergence, LGT and rearrangement. We introduce a new application of the jackknife24 technique to provide node-support values to trees inferred by AF approaches, and demonstrate the scalability and potential of AF approaches in inferring phylogenetic trees quickly and accurately from genome-scale data.

4.1.3 Results

#25 26 27 28-30 31,32 We investigated nine AF phylogenetic methods: !" , co-phylog , cvt , ffp and spaced based on word count, and acs33, gram34, kmacs31,35 and kr36 based on match length (see Supplementary Information). The distance matrix generated by each of these methods was used to infer phylogenetic relationships using neighbour-joining. As a measure of accuracy we used the Robinson-Foulds distance37, which evaluates the topological congruence between an inferred tree and a reference tree. We designate RF as the normalised Robinson-Foulds distance (see Methods): RF = 0 indicates that the test-tree topology is completely congruent with that of the reference, while RF = 1 indicates that the two trees have no bipartitions in common.

4.1.3.1 Simulated data Using simulated data, we independently assessed the sensitivity of each AF approach to evolutionary scenarios of genome divergence, LGT and genome rearrangement. Because the underlying tree on which each dataset has been simulated is known, we used that as the reference. Here we report, for each scenario, results generated with each AF approach using the parameter setting that yields the smallest mean RF globally across all rates; these settings are optimal in this context. The parameters # of interest are k (k-mer length) for cvt, !" and ffp; K (half-context length) for co-phylog; n (number of patterns) for spaced; and mm (number of mismatches) for kmacs. We used default parameters for acs, gram and kr. 93 Genome divergence. We simulated sets of genomes (size N from 25 to 35) using a birth-death model across different levels of genome divergence as determined by the mutation rate m = 0.1, 0.5 and 0.9; m is the number of nucleotide substitutions per iteration. At maximum rate m = 1, almost all nucleotides are substituted per iteration38. Figure 2 shows the mean RF for each approach (50 replicates) across these values of m. Optimal settings for each approach were determined based on a wide range of parameters (see Supplementary Information and Supplementary Figure S1 for details). For all the word-count methods except spaced, RF is minimum at m = 0.5 (Figure 2a), i.e. at an intermediate level of sequence divergence. On the other hand, for the match-length methods (except kmacs) RF increases with greater sequence divergence (Figure 2b), suggesting that their accuracy in inferring the correct genome phylogeny decreases as the sequences become more dissimilar from one another.

a) Word-count S co-phylog cvt D2 ffp spaced

0.75 K = 6 k = 21 k = 16 k = 20 n = 60

0.50 RF 0.25

0.00 0.1 0.5 0.9 0.1 0.5 0.9 0.1 0.5 0.9 0.1 0.5 0.9 0.1 0.5 0.9 m b) Match-length acs gram kmacs kr

0.75 mm = 17

0.50 RF 0.25

0.00 0.1 0.5 0.9 0.1 0.5 0.9 0.1 0.5 0.9 0.1 0.5 0.9 m Figure 2: Accuracy of AF methods based on m. RF distances are shown for (a) word-count methods and (b) match-length methods at m = 0.1, 0.5 and 0.9. Error bars indicate standard deviation from the mean across 50 replicates.

Interestingly, the pattern of RF values observed for kmacs (RF = 0.06, 0.05 and 0.08 at m = 0.1, 0.5 and 0.9) follows that of most word-count methods, while the pattern of RF values we find for spaced (RF = 0.10, 0.31 and 0.48 at m = 0.1, 0.5 and 0.9) increases with greater sequence divergence, as with

94 most of the match-length methods. Lateral genetic transfer. We simulated genome sets at different extents of LGT as determined by the mean number of LGT events per iteration l = 0, 5, 25, 125, 250, 500; l is the mean number of LGT events attempted in the set at each iteration (i = 5000) during the simulation. Thus at l = 5, the simulator attempted five LGT events per iteration across a genome set (see Methods). LGT events were simulated to occur at random, i.e. anywhere along a genome sequence and between any pair of genomes in a set. Figure 3 shows the mean RF for each AF method (50 replicates) across different values of l. Optimal settings for each method were determined by tuning a range of parameters (Supplementary Information and Supplementary Figure S2). In general, word-count methods (Figure 3a) achieved lower RF values than did match-length methods (Figure 3b) against the complication of LGT. Not surprisingly, for most AF methods the accuracy of inference falls off with increasing extent of LGT. Several methods were nonetheless able to reconstruct the reference tree with little incongruence at all but the two highest frequencies of LGT simulated, notably co-phylog, cvt, and kmacs (RF < 0.10 at l ≤ 125; Figure 3). a) Word-count

S co-phylog cvt D2 ffp spaced 0.8 K = 6 k = 19 k = 16 k = 19 n = 60 0.6

0.4 RF

0.2

0.0 0 5 25 125 250 500 0 5 25 125 250500 052512525050005251252505000525125250500 l b) Match-length acs gram kmacs kr 0.8 mm = 17 0.6

0.4 RF

0.2

0.0 0 5 25 125 250 500 0 5 25 125 250 500 0525125250500 0525125250500 l Figure 3: Accuracy of AF methods based on l. RF distances are shown for (a) word-count methods and (b) match-length methods at l = 0, 5, 25, 125, 250 and 500. Error bars indicate standard deviation from the mean across 50 replicates.

95 The word-count method spaced (Figure 3a) and the match-length methods gram and kr (Figure 3b) were relatively inaccurate (RF > 0.2 across l) by this measure. Although spaced and gram are robust against variation of l, the inaccuracy of kr decreased with increasing l while remaining unacceptably high. As mean RF values approaching 1.0 indicate that many of the inferred topological features are scarcely distinguishable from those of a random tree39, it appears that our simulation scenario was too extreme for kr. We also varied the maximum evolutionary distance within which LGT can occur. At fixed l = 5 we simulated LGT while setting the divergence factor d = 200, 1000, 3000 or 5000, where d is the maximum number of iterations (generations) that can separate two genomes from their common ancestor for a proposed LGT event to be accepted. This simulates biological situations in which genetic material can be successfully transferred only among related organisms due to e.g. limited plasmid host range, or a threshold of local sequence identity below which homologous recombination is not successful. Figure 4 shows the mean RF for each AF method (50 replicates) across d at otherwise optimal parameter settings (Supplementary Information and Supplementary Figure S3). For both word-count (Figure 4a) and match-length (Figure 4b) methods, RF varies little with d. Thus the effectiveness of these AF methods to reconstruct the reference tree in the presence of LGT is little affected by relatedness of the sequence donor. a) Word-count

S co-phylog cvt D2 ffp spaced 0.8 K = 6 k = 22 k = 16 k = 19 n = 60 0.6

0.4 RF

0.2

0.0 200 1000 3000 5000 200 1000 3000 5000 200 1000 3000 5000 200 1000 3000 5000 200 1000 3000 5000 d b) Match-length acs gram kmacs kr 0.8 mm = 17 0.6

0.4 RF

0.2

0.0 200 1000 3000 5000 200 1000 3000 5000 200 1000 3000 5000 200 1000 3000 5000 d Figure 4: Accuracy of AF methods based on d. RF distances are shown for (a) word-count methods and (b) match-length methods at d = 200, 1000, 3000 and 5000. Error bars indicate standard deviation from the mean across 50 replicates.

96

Genome rearrangement. We simulated genome sets as above (see Methods for details) at different frequencies of genome rearrangement r = 0.00, 0.01, 0.10 or 1.00 (each in 50 replicates). A rearrangement event was simulated as an inverted translocation occurring at any position of a genome. The latter three frequencies correspond to rearrangement of approximately 0.20%, 2.0% and 20% of each genome sequence (by length). As above, we report results generated using the parameter settings optimal for each AF method (see Supplementary Information and Supplementary Figures S4 and S5). Figure 5 shows the mean RF for each method across these values of r, at optimal parameters. Three # of these AF methods, co-phylog, !" and kr, were affected by these simulated rearrangements much less (RF < 0.03 across r) than were the others (RF > 0.05). No trend with r was apparent. These results extend an earlier study17 in which AF methods were likewise found to be robust against genetic recombination at the level of individual genes. a) Word-count

S co-phylog cvt D2 ffp spaced 0.20 K = 7 k = 16 k = 16 n = 60 0.15 k = 16

0.10 RF

0.05

0.00 0.00 0.01 0.10 1.00 0.00 0.01 0.10 1.00 0.00 0.01 0.10 1.00 0.00 0.01 0.10 1.00 0.00 0.01 0.10 1.00 r b) Match-length acs gram kmacs kr 0.20

0.15 mm = 17

0.10 RF

0.05

0.00 0.00 0.01 0.10 1.00 0.00 0.01 0.10 1.00 0.00 0.01 0.10 1.00 0.00 0.01 0.10 1.00 r Figure 5: Accuracy of AF method based on r. RF distances are shown for (a) word-count methods and (b) match-length methods at r = 0.00, 0.01, 0.10 and 1.00. Error bars indicate standard deviation from the mean across 50 replicates.

4.1.3.2 Empirical data To assess the performance of the AF methods in application to empirical data, we used three sets of microbial genome sequences: (a) 143 bacterial and archaeal genomes8 previously used to infer the extent of LGT across divergent taxa, (b) 27 genomes of Escherichia coli and Shigella3 used to infer 97 LGT among more-closely related taxa, and (c) eight Yersinia genomes that are too similar in sequence for classical phylogenetic inference, but share patterns of genome rearrangement5 (see Methods and Supplementary Information for details). As references we used the MRP supertree40 summarising the well-resolved subtrees inferred by a standard MSA-based approach from single-copy protein sets for the former two studies3,8, and the consensus phylogenetic network based on genomic inversion events inferred from a whole-genome alignment for the Yersinia dataset5. For each AF tree we computed the jackknife (JK) support for each node across 100 pseudo-replicates (see Methods). 143 bacterial and archaeal genomes. Figure 6 shows the 143-taxon tree inferred using an AF # approach (!" at k = 24), with the JK support for each node. Among the trees we inferred using other AF methods and parameter settings (see Supplementary Information and Supplementary Figure S6), this tree shows the greatest topological similarity (RF = 0.42) to the reference supertree8. Although 42% of bipartitions are incongruent with the reference, we recover 13 of the 15 phylum-level “backbone” nodes identified by Beiko et al.8 (Figure 6), some with very strong JK support (e.g. Bacteriodetes 100%, Chlamydiales 99%, high G+C 98%). An inclusive grouping of proteobacteria is poorly supported (18%), but monophyletic epsilon- (99%) and alpha-proteobacterial (71%) clades are recovered, while the beta- and gamma-Proteobacteria form a single modestly supported (62%) clade within which many gamma-Proteobacteria constitute a solid (100%) sub- clade. By contrast, unconvincing JK values are recovered for several nodes along the subtree that joins the beta- and remaining gamma-Proteobacteria. There are two major discrepancies between our # !" tree and the MRP reference: the Spirochaetales no longer form a monophyletic clade (Treponema pallidum is attracted to the problematic8 Aquifex and Thermotoga genomes), while our Crenarchaeota become non-exclusive (30%) with the inclusion of Nanoarchaeum. Overall, basal nodes exhibit lower JK support than do the more-terminal and leaf nodes, as is often the case for MSA-based trees of similar phyletic depth. As has been seen before41, most of the topological differences between the AF tree (Figure 6) and the reference supertree arise from how bipartitions are resolved on short branches within major groups, not from disagreement about the membership of these groups.

98 Euryarchaeota aH t

el

β-Proteobacteria r.D Nanoarchaeota t γ-Proteobacteria Crenarchaeota

phicuss NitrosomonaseuropaeaATCC19718 β-Proteobacteria Euryarchaeota ChromobacteriumviolaceumATCC12472

3638 Neisseriameningits nameisMC58 19 mautotro schiiDSM2661 Neisseriameningits na er

eiGo1 Xylell usDSM4304 Crenarchaeota Planctomycetes iAV M r Cox 2 GE5 anna

Ralsto usDSM afastidiosaTemecula1 Kin4- iellaburnetiiRSA493 Pse Xylellafastidiosa9a5c ccusj cinamaz aricusP niasolanacearumGMI1000 r busfulgid tans udomonassyringaepvBordetellapBordetell i o mobacterth Xanthomonascampestrispv.campestrisstr.ATCC33913 l daiistr.7 Chlamydiales Bordetellabronchiseptic doco nixK1 -3/CX PseudomonasaeruginosaPA01 cushorikoshiiOT3 her nosa solfat

eog opyruskandle per XanthomonasaPseudomonasputs nameaKT2440apertussisTohama ocal bustoko ar meisZ2491 rococcusfurios o iaeCWL029

apert MethanosarcinaacetivoransC2A Metha Archa Pyrococcusabyssi ethanot Py Pyrococ Methan archaeumequ M iumsp.NRC-1 Methan r Sulfolobus baculumaerophilumstr.IM2 u 0 llulabalticaSH1 Spirochaetales .tomatostr.DC3000 Sulfol o VibriocholeraeO1biovarElTorstr.N16961 ssis128 0 1 Aeropyrum 6 x 61 1 moplasmaacidophilumDSM1728 0 Nano pneumon onopodisp 15 Pyr hermoplasmavolcaniumGSS1pire a 100 Shewane 95 10 T Vibriopar a 100 100 R 50 Ther odo ydiamuridarumNiggophil B5022 82 Halobacte 2 42 Pasteurellamultocs nameasubsp.multocs nameastr.Pm70 13 Rh ahaemolyticusRIMD22106331l v.citristr.306 ChlamydiatrachomatisD/UW arlaistr.56601 laoneidensi 30 100 18 Chlam 47 amydophilapneumoniaeTW-183 100 1 100 Chlamyd 0 3 100 2 ydophilacaviaeGPIC 0 44 Chl 100 39 VibriovulnificusYJ016sMR-1 6 ChlamydophilapneumoniaeJ138 97 62 ChlamamydophilapneumoniaeAR39nterrogansserovmicronVPI-5482 Vibriov 4 100 100 58 0 Chl 99 10 HaemophilusinfluenzaeRdKW20ulnificusCMCP6 41 0 BorreliaburgdorferiB31 10 Photorhabdusluminescenssubsp.laumondiiTTO1 27 Leptospirai Haemophilusducreyi35000HP 100 PorphyromonasgingivalisW83eroidesthetaiotao 21 Bact 100100 100 100 Prochlorococcusmarinusstr.MIT9313 96 62 85 Synechococcussp.WH8102 100 63 ThermosynechococcuselongatusBP-1 YersiniapestisCO92 79 GloeobacterviolaceusPCC7421 γ-Proteobacteria YersiniapestisKIM10+ 35 39 100 Nostocsp.PCC7120 Shigell 71 Synechocystissp.PCC6803 aflexneri2astr.301 96 48 orococcusmarinussubsp.pastorisstr.CCMP1986rinusstr.CCMP1375 100 8 Prochl usmarinussubsp.ma Shigellaflexneri2astr.2457T 100 Prochlorococc 100 41 100 EscherichiacoliCFT073 17 DeinococcusradioduransR11 Deinococcus 100 100 90 100 100 8 MycoplasmapneumoniaeM129 EscherichiacoliO157H7EDL933 59 6 .Sakai 96 MycoplasmagenitaliumG37 EscherichiacoliO157H7str 655 92 100 33 Mycoplasmaga str.MG1 83 7 Ur llisepticumstr K-12sub T18 50 eaplasmap .R iacolistr. C 18 1 Myc arvu Escherich 2 100 53 oplasmap mserovar3 ricaserovarTyphistr.yphistr.Ty Mycoplasmapeneulm str.ATCC70097 ente 7 100 Fusobacteriumnucleaonis subsp. 97 3 UAB 0 enterica ntericaserovarT 2 100 49 Str tr CTIP nella himuriumstr.LT2evipalpis 74 78 eptococcuspyogenansHF- Salmo abr 94 100 Stre 2 tericasubsp.e .Bp 100 0 67 99 Streptptococcu 57 tumsub onellaen 100 1 10 30 Sg 75 Strepto Salm 00 1 ococcuspyogenesMGAS8232s subsp.entericaserovarTyp 99 0 00 Strept spyoge esSSI-1 p.nucleatumATCC25586 terica ndosymbiontofGlossin 48 c iae aaphidicolastr 819 100 100 St oc nesMGAS31 d ndidatusBlochmanniafloridanusphidicolastr. StreptococcuspneumoniaeR6rept ococcusagalactiaeNEM31cuspyogenesM1GASstr.SF370 Salmonellaen glossini Ca Buchner dicolasp.APS 100 Strepto ococcusagalactiae2603V 6959 1 100 phi ATCC700 92 55 96 00 Streptococc 5 26 7 99 Buchneraa 77 Lactococcuslactissubsp.la Wigglesworthia coccu chneraa enesDSM1740 21 71 En 100 98 S Bu C11168 54 taphylocteroco 47 100 Staphylococcusaureussubsp.aureusN315sp acterpylori 100 NCT 93 10 Staph i acterpyloriJ9 96 7 usmu liloti10 43 ne cob 00 0 100 Staphyloco 6 4 0 ccu u 1 1 100 L mo

1 Heli 1 81 Listeriaisteriamonocy tansU mme 00 / Helicob .1str.16M 2 00 Lactobacil yloc oc sfaecalisV58 niaeTIGR4 R 099 4 Bacilluscereus 100 45 Wolinellasuccinog 100 c nisubsp.jejun DA110 Bacill ju 100 Baci oc us A15 100 Bacill erhepaticusATCC51449 O hizobiu Clostr aur F303 Clostridiumperf cusaureu C innocua 9 Thermo c c eanob l usanthracisstr.Ames cusepi eus lostrid lus ctisIl1403

3 Sinor Brucellasuis1330 umTLS N ylobacterje ush 05 lusplan 3

umfabrumstr.C58circular T s s rA idiuma ubs Helicobact e ubtilissubsp.subti to Camp o

l a aloduransC-125 Clip11262g de ssub

r H37Rv a p.aur aponicumUS iumtetaniE88 en a

p cillusiheyensis ATC rmidis DC1551 Brucellamelitensisbv ico naerobact

e esEGD-e xaeolicusVF5 l taru l sp eusM Agrobacteri ce m C14 .au

CaulobactercrescentusCB15 ulosis mWCFS1 ATCC1 u tobu

umNCC27 i r ε-Proteobacteria r reusMW2 u50 in

Aquife e 57 siaprowazekiistr.MadridE t

Rickettsiaconoriistr.Malish7Chlorobiumtepid gensstr.13 bsp.pallidumstr.Nichols culosisC c ty 9 MesorhizobiumlotiMAF vermitilisMA-4680 a 222

tuberc ert l icumATCC82 Bradyrhizobiumj b

lisstr.168 o engcongensisMB4 HTE831 8

Rickett c α-Proteobacteria ThermotogamaritimaMSB8 y

allidumsu cteriumlong M cterium Low G+C TropherymawhippleiTW08/27

Tropherymawhippleistr.Twist Streptomycescoe cteriumtuber

idoba Firmicutes if MycobacteriumbovisAF2122/97

CorynebacteriumefficiensYS-314 ycoba 4

Streptomycesa

B

Chlorobiales ponemap M

Tre Spirochaetales Mycoba CorynebacteriumglutamicumATCC13032 Aquificaceae CorynebacteriumdiphtheriaeNCTC13129 Thermotoga High G+C Firmicutes # Figure 6: AF phylogeny of 143 prokaryote genomes. Phylogenetic tree of 143 prokaryote genomes using !" at k = 24, supported by JK values.

99 27 Escherichia coli and Shigella genomes. Figure 7 shows the phylogenetic tree of 27 E. coli and # Shigella genomes generated using co-phylog (K = 8; Figure 7a) or !" (k = 26; Figure 7b), as well as the reference MRP supertree that summarises 5282 single-copy protein trees3 (Figure 7c). These 27 taxa have been assigned to six distinct groups (the E. coli reference, or ECOR, strains3,42) based on allelic diversity of 11 genes43. Each ECOR group is monophyletic, except B2 in both AF trees # and A in the !" tree; the relationships among the ECOR groups in the co-phylog and MRP trees are identical. The taxa labeled with an asterisk in each AF tree (Figure 7a and 7b) are positioned differently in comparison to the reference. In comparison to the reference (Figure 7c), the same relationship among the six phyletic groups is recovered in the tree generated using co-phylog (Figure 7a) at strong JK support (≥ 84% in all but two nodes), with minor topological difference within Groups B1 and D (RF = 0.083): E. coli 55989 (instead of E24377A) is placed as basal lineage in Group B1, while E. coli UTI89 (instead of E. coli S88) is placed as sister to E. coli # APEC O1 in Group D. On the other hand, the tree generated using !" (Figure 7b; all nodes JK ≥ 90%) is less congruent with the reference (RF = 0.45); all Shigella isolates (noted as Group S) form a strongly supported clade (JK = 100%) with E. coli ATCC 8739, while in the reference tree Shigella dysenteriae is placed externally to the other Shigella isolates, as sister to Group E, the pathogenic E. coli O157:H7 isolates. As shown in Figure 7b, groups A and B1 are sister groups, supporting previous studies42,44 (Figure 7b) and group E is placed within a strongly supported clade (JK 100%) together with isolates of Group B2 and D, and not with Groups A, B1 and S as shown in the reference (Figure 7c). # Eight Yersinia genomes. Figure 8 shows the phylogenetic trees based on !" at k = 7 (Figure 8a) and k = 9 (Figure 8b). These topologies are well-supported, with all nodes showing JK ≥ 74% and ≥ 97% respectively. Topologically the k = 7 tree is more congruent with the reference, which was based on inversion events5, than is the k = 9 tree. The two Y. pseudotuberculosis isolates are sisters (81%) in the k = 7 tree, as in the reference5. At k = 9, on the other hand, Y. pseudotuberculosis IP31758 groups solidly (100%) with Y. pestis KIM relative to the others. In the whole-genome alignment (generated using progressiveMauve in the previous study5; Supplementary Figure S7) the region of the Y. pseudotuberculosis IP31758 genome between about 1.3-3.3 Mbp shares more- similar configuration of locally collinear blocks5 with the genome of Y. pestis KIM than with the genome of Y. pseudotuberculosis IP32953 (e.g. the corresponding region is largely in reverse complement). The two Y. pseudotuberculosis genomes are the most dissimilar in this set, sharing only 40.5% of all 12-mers present in these two genomes, whereas Y. pseudotuberculosis 32953 and Y. pestis pestoides F 15-70 are the most similar (72.4% of shared 12-mers; Supplementary # Table S1). Thus depending on the value of k, !" can draw out phylogenetic signal that supports either the species classification (Figure 8a) or the major rearrangement events apparent in the

100 whole-genome alignment (Figure 8b).

a) co-phylog (K=8) tree b) (k=26) tree c) MRP tree

E.coli SMS 3 5 E.coli SMS 3 5 100 B2 100 B2 E.coli SMS 3 5 E.coli IAI39 E.coli IAI39 E.coli IAI39 B2 100 100 E.coli 0127 H6 E2348 69 E.coli 0127 H6 E2348 69 E.coli 0127 H6 E2348 69 100 E.coli 536 E.coli 536 88 99 E.coli 536 E.coli CFT073 100 E.coli CFT073 100 100 E.coli CFT073 E.coli ED1a E.coli UTI89 D 100 100 D E.coli ED1a 31 E.coli S88 D E.coli S88 * 100 95 E.coli UTI89 100 100 E.coli UTI89 * E.coli APEC O1 100 E.coli S88 100 E.coli ED1a E.coli APEC O1 E.coli APEC O1 E.coli UMN026 B2 E.coli UMN026 B2 E.coli UMN026 E.coli O157H7 * B2 100 S. dysenteriae S 100 E S. dysenteriae 100 E.coli O157H7 EDL933 * S E.coli O157H7 E.coli O157H7 100 E S.flexneri 2a * E E.coli O157H7 EDL933 100 E.coli O157H7 EDL933 S.flexneri 2a 2457T E.coli K 12 subst W3110 100 100 E.coli K 12 subst W3110 100 S.flexneri 5 8401 * E.coli K 12 subst MG1655 100 S E.coli K 12 subst MG1655 100 A S.sonnei 046 E.coli HS A 100 E.coli HS 97 S.boydii 227 * E.coli C ATCC 8739 E.coli C ATCC 8739 100 S. dysenteriae * 100 S.flexneri 5 8401 90 100 S.flexneri 5 8401 S.flexneri 2a 90 E.coli C ATCC 8739 * A 100 100 S.flexneri 2a S.flexneri 2a 2457T S.boydii CDC 3083 94 * S S S.flexneri 2a 2457T S.sonnei 046 E.coli K 12 subst W3110 * S 98 100 S.sonnei 046 84 A S.boydii 227 99 E.coli K 12 subst MG1655 * 100 S.boydii 227 S.boydii CDC 3083 94 E.coli HS * S.boydii CDC 3083 94 53 E.coli E24377A * 100 E.coli E24377A * E.coli E24377A 93 96 E.coli IAI1 * E.coli IAI1 * E.coli IAI1 84 B1 100 B1 B1 100 E.coli SE11 100 E.coli SE11 E.coli SE11 E.coli 55989 * E.coli 55989 * E.coli 55989 Figure 7: Phylogenetic trees of 27 E. coli and Shigella species. (a) Tree generated using co- # phylog at K = 8, supported by JK values. (b) Tree generated using !" at k = 26, supported by JK values. (c) MRP tree constructed from 5282 Bayesian nucleotide trees. Taxa labeled with an asterisk in each AF tree (a, b) are positioned differently in comparison to the reference (c). ECOR groups and Shigella (S) are indicated.

a) ( k=7) b) ( k=9) Yersinia Yersinia pseudotuberculosis pestis IP31758 Antiqua Yersinia Yersinia pestis pestis Yersinia Yersinia Pestoides F 15-70 Yersinia 91001 pseudotuberculosis pseudotuberculosis pestis IP31758 IP32953 CO92 100 81 Yersinia pestis 100 81 87 100 CO92 100 100 74 100 97 Yersinia Yersinia pestis pestis 89 Yersinia Yersinia KIM 91001 pestis pseudotuberculosis Antiqua IP32953 Yersinia Yersinia Yersinia Yersinia pestis pestis pestis pestis Nepal516 KIM Pestoides F 15-70 Nepal516 # Figure 8: Phylogenetic trees of 8 Yersinia genomes. (a) Tree generated using !" at k = 7, # supported by JK values. (b) Tree generated using !" at k = 9, supported by JK values.

4.1.3.3 Computational efficiency and scalability Table 1 shows the computation time (t) taken by each AF method to generate pairwise distance

101 matrices for each of the three empirical datasets described above, using a single CPU and only one # thread for the multi-threaded methods (e.g. !" and spaced). For each method, t increases with N. The approach using ffp is the fastest, requiring t = 0.11, 0.65 and 2.22 minutes at N = 8, 27 and 143, while the spaced and acs methods are the most computationally intensive among these nine methods. The requirement for memory varies over two orders of magnitude, from less than 500 # MB for kmacs to more than 60 GB for acs and !" at N = 143.

Table 1: Computation time required by the nine AF methods. The computation time, t (in minutes) for all the methods across three empirical datasets (N = 8, 27 and 143). Mean genome lengths (with standard deviation) are 4.634±0.080, 4.906±0.294 and 3.011±1.802 Mb for these # datasets respectively. We set k = 16 for !" , cvt and ffp, K = 7 for co-phylog, n = 60 for spaced and mm = 12 for kmacs.

AF method N = 8 N = 27 N = 143 co-phylog 0.37 3.47 82.80 # !" 2.21 9.06 20.35 cvt 4.80 37.15 179.76 ffp 0.11 0.65 2.22 spaced 12.67 154.42 12713.54 acs 18.67 110.75 1197.71 gram 55.34 212.33 723.87 kmacs 13.82 261.55 594.44

kr 0.93 5.16 28.34

4.1.4 Discussion In this study we demonstrate that AF phylogenetic approaches can be used to quickly and accurately infer phylogenomic relationships of microbes using whole-genome data. We also introduce for the first time a method, based on the jackknife, to provide node-support values in phylogenetic trees constructed using AF approaches.

We examined two types of AF methods in this study. In general, the methods based on word count, # particularly co-phylog and !" , outperformed match-length methods. All these methods performed well on highly similar sequence data, but the methods based on word count (except spaced) are more robust as the input data become more divergent. All these AF methods proved robust against LGT at frequencies below l = 250 (at which every gene is likely to have affected by LGT at least once38), antiquity of LGT and genome-scale rearrangement, extending our earlier results based on

102 analysis of gene-scale sequence data17. Parameter values in this study were optimised for these data, based on our results and (for other AF methods) the original authors’ recommendations. These parameters are most sensitive to sequence length and genome divergence. AF methods have the advantage of being computationally fast, so we recommend that users compare a range of parameters on this basis.

The jackknife approach was simple to apply without prior sequence alignment. Here we deleted 40% of the data because 40% was previously shown to provide a reasonable balance between the generation of useful replicates without totally losing phylogenetic signal in sequence data45,46. The resulting JK support values appear to be biologically meaningful, as recognised taxa were often strongly supported (see also Supplementary Information and Supplementary Table S2). Particularly in the 143-genome dataset, JK support tended to decrease with increasing sequence divergence within-group, as is also seen with bootstrap support and Bayesian posterior probability in MSA-based studies. Problematic clades tend to be more weakly supported; for example, the beta- and gamma-Proteobacteria (JK 62% in Figure 6) were the most-frequent LGT partners in the MSA-based study of this dataset8. Similarly, the appearance of Pseudomonas genomes within a clade of beta-Proteobacteria (JK 62%) can be explained by more than 150 LGT events8. Our AF tree also supports the hypothesis that Archaea is monophyletic (41%) and distinct from Bacteria47 (JK support 90% excluding the single representative (Rhodopirellula) in this dataset of the problematic Planctomycetes48).

Likewise, AF trees based on the 27-genome E.coli and Shigella dataset recover the ECOR reference groups3. We observed two distinct trees: one congruent with the MRP supertree, and another that supports a monophyletic Shigella and the sister-group relationship between ECOR groups A and B1 as previously expected44 but not recovered in the supertree. As described above, relationships among the Yersinia strains cannot be resolved using a standard sequence-based approach: the 16S ribosomal RNA sequence is identical for seven of these isolates, while the eighth (Y. pseudotuberculosis IP 31758) differs in only five positions. All our AF-based methods (Figure 8 and Supplementary Information) recovered only two types of topology, one of which reflects the extent of genome rearrangement among these strains. Given the intricate evolution of these taxa and the fact that we cannot recreate history the true phylogeny in these instances remains an open question.

Our use of an MRP supertree as reference is not entirely unproblematic in this context. A supertree summarises the well-supported topological features in a set of input trees; in these two cases3,8 and many others, each input tree arises via a workflow of orthogroup identification, MSA and Bayesian

103 phylogenetic inference. Even in microbial genomes, where protein-coding genes tend to be tightly packed, many regions are excluded from contributing to the reference topology including non- protein-coding, intergenic, rearranged, mis-assembled, unalignable or low-complexity regions, paralogs, pseudogenes and repetitive elements. By contrast, AF approaches make use of the entire genome sequence and are less dependent on pre-defined evolutionary models; assumptions implicit in these models are known to be highly simplified and may be unrealistic9.

The computational complexity of all the AF methods can be found in previous studies25-27,30,31,34,36 and ranges from % & for ffp, to % '×&×) for kmacs, with n the number of sequences, k the number of mismatches and z the average number of matches between two sequences. An advantage that AF methods have over the standard MSA-based approaches is their scalability15,17. Among the AF methods we tested, methods based on word counts can be orders of magnitude faster than those based on match lengths, but tend to be more memory-intensive and more sensitive to parameter settings. It has been proposed49 that filtering out non-informative k-mers can be a useful approach to reducing memory requirements, and one can imagine a systematic, adaptive approach to parameterisation of k based on the nature of the sequence data (e.g. extent of divergence, presence of repetitive elements) as learned from simulated sequences. Biologically motivated approaches could likewise be explored, e.g. the adaptive use or weighting of k-mer patterns common in genomes (as learned from a comprehensive genome k-mer database) and/or relevant to biological processes or molecular function.

In cases where assessment of aligned positions across conserved regions is necessary, e.g. to infer structural features of proteins, MSA remains indispensable. However, these and other results show that AF approaches present exciting alternatives in phylogenetic inference for large sets of microbial-sized genomes at different phyletic breadth, even in the presence of genomic rearrangement and LGT. Indeed it might be asked whether AF approaches might be modified to detect genomic regions of lateral origin; we intend to address this in the very near future.

4.1.5 Methods Simulated genome data. For all programs mentioned below, default settings were used unless otherwise specified. We simulated sets of genomes using two different programs depending on their functionality: Evolsimulator38 to simulate scenarios relevant to lateral genetic transfer (LGT), and ALF50 to simulate genome rearrangement. Evolsimulator allows the user to specify rates independently for speciation, extinction, gene content, extent of gene loss and/or duplication, nucleotide substitution and LGT. Moreover, LGT events can be allowed to succeed automatically, or can be evaluated under different criteria (e.g. controlled by gene complement, divergence, G+C

104 similarity and/or host habitat); the receptivity of each genome to LGT can also be specified. We used the Generalised Time-Reversible (GTR)51 substitution model (rate parameters a=0.987, b=0.11, c=0.218, d=0.243, e=0.395) in all simulations. We detail our simulation strategy for each evolutionary scenario below.

Sequence divergence and lateral genetic transfer. Each set of genomes was simulated under a birth-and-death model at speciation rate = extinction rate = 0.5. The number of genomes in each set was allowed to vary from 25 to 35, with each containing 2000-3000 genes of length 240-1500 nucleotides. LGT receptivity was at set at minimum 0.2, mean 0.5 and maximum 0.8, mutation rate m = 0.4-0.6 and number of generations i = 5000. Sequence change was simulated under a discrete approximation of the gamma distribution (shape parameter * = 1.0, 8 categories). To study the effect of LGT we simulated LGT between randomly selected genomes, varying the mean number of LGT events attempted per iteration l = 5, 25, 125, 250 or 500 (each scenario in 50 replicates). We also simulated datasets in which genomes were not selected at random (but instead were restricted by number of generations from their common ancestor) by varying the divergence factor d: 200, 1000, 3000 or 5000 (with l fixed at 5, m 0.4-0.6); larger d allows more-dissimilar sequences to be transferred. To assess the effect of genome divergence we set l = 5 and simulated datasets at three rates of mutation m: 0.01-0.2, 0.4-0.6 and 0.8-0.99. All other parameters in this simulation follow Beiko et al.52

Genome rearrangement. We simulated different extents of rearrangement (random events of inverted translocation) within genome sequence sets using ALF, setting the rate r = 0.00, 0.01, 0.10 or 1.00 (each in 50 replicates). These values correspond to zero and approximately 0.2%, 2.0% and 20% of the length of each genome undergoing rearrangement. Other settings were number of genomes in each set N = 30 (each with 2500 genes of length 240-2000 nt), speciation rate = 0.5, extinction rate = 0.1, and mutation rate = 0.5. We allowed up to 20 genes to be involved in each inverted translocation event, following a uniform distribution: i.e. there was a 5% chance for one gene to be involved, a 5% chance for two genes to be involved, and so on.

Empirical genome data. All genome datasets were obtained from NCBI (ncbi.nlm.nih.gov) and are based on published studies. The dataset of 143 diverse bacterial and archaeal genomes is a subset of the 144 genomes used in a previous study8 to detect highways of gene sharing; two genomes in that study, Agrobacterium tumefaciens Cereon and C58, have since been merged into a single record in the NCBI database (as Agrobacterium fabrum str. C58). The 27 Escherichia coli and Shigella genomes, and the eight Yersinia genomes, were taken respectively from Skippington and Ragan3 and Darling et al.5

105 Phylogenetic trees. For each genome set, we used nine AF methods to compute a phylogenetic # tree. These methods all into two groups: word-count (!" , cvt, ffp, co-phylog, spaced) and match- length methods (gram, acs , kr, kmacs) (see Supplementary information). Six methods require a # key parameter value to be set: k-mer length k for !" , ffp and cvt; half-context length K for co- phylog; number of patterns n for spaced; and number of mismatches mm for kmacs. We tested wide range of values for these parameters (Supplementary Information) but for reasons of space report only the results obtained using the optimal parameter value, defined as that yielding the topology most congruent to the reference tree (see text). The distance matrix generated by each AF method was input into the neighbour-joining algorithm (neighbor in PHYLIP v3.69) to generate the corresponding tree.

Jackknifing technique. For the jackknifing procedure we adapted the approach used by Shi et al.46 and used a jackknife rate of 40% as suggested by the authors and our tests on different jackknife rates (see Supplementary information and Supplementary Figure S8). From each genome set we generated 100 pseudo-replicates: for each pseudo-replicate a randomly selected 100-nt fragment was deleted from each genome, and this was iterated to a total of N times where N = (genome length x 0.4)/100, resulting in the deletion of (a different) 40% of each genome in each set. Then we generated a phylogenetic tree for each replicate using an AF method, and calculated node support by comparing the pseudo-replicate trees and the test tree.

Assessment of accuracy. To assess the accuracy of each AF method we computed the Robinson- Foulds distance between a tree computed using that method (the “test tree”) and the corresponding reference tree, using treedist in PHYLIP v3.69. This distance represents the number of bipartitions that are present in only one of the two trees. To facilitate comparison of our results sequence sets (hence trees) of different sizes N, following Kupczok et al.53 we normalise this distance according to the maximum possible distance between two unrooted trees, 2(N - 3). We denote this normalised Robinson-Foulds distance as RF, with a value from 0 to 1 that can be interpreted as the proportion of false or missing bipartitions in the test-tree topology compared to the reference topology. When RF = 0 the test and reference topologies are identical, implying high accuracy for the method. Conversely, at RF = 1 no bipartition in the reference is recovered. A pair of randomly generated tree topologies of N taxa has a Robinson-Foulds distance that approaches the denominator for normalisation, 2(N - 3)54, i.e. when RF = 1. For the simulated data we used as reference the known tree (according to which the sequences were simulated) provided by Evolsimulator or ALF. For the E. coli - Shigella and 143-genome empirical datasets we used as reference the MRP supertrees from Skippington and Ragan3 and Beiko et al.8 respectively; in these cases RF does not inform directly on accuracy (as the true tree is unknown), but instead reflects the extent to which the AF

106 method recovers the published topology. For the Yersinia dataset we used the consensus phylogenetic network5 as reference.

Computational scalability and runtime. Assessment of computational scalability was carried out using a high-performance distributed-memory computing cluster based on Intel Xeon 'Haswell' Cores 3.1GHz. Comparative runtime analysis of alignment-free methods was done on Intel Xeon 'Haswell' Cores (E5-2667 v3) @ 3.1GHz / boost 3.5 GHz (using a single processor and one thread).

4.1.6 References 1 Tong, S. Y. et al. Genome sequencing defines phylogeny and spread of methicillin- resistant Staphylococcus aureus in a high transmission setting. Genome Res 25, 111-118, doi:10.1101/gr.174730.114 (2015).

2 Dunn, C. W. et al. Broad phylogenomic sampling improves resolution of the animal tree of life. Nature 452, 745-749, doi:10.1038/nature06614 (2008).

3 Skippington, E. & Ragan, M. A. Within-species lateral genetic transfer and the evolution of transcriptional regulation in Escherichia coli and Shigella. BMC Genomics 12, 532, doi:10.1186/1471-2164-12-532 (2011).

4 Jarvis, E. D. et al. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346, 1320-1331, doi:10.1126/science.1253451 (2014).

5 Darling, A. E., Miklós, I. & Ragan, M. A. Dynamics of genome rearrangement in bacterial populations. PLoS Genet 4, e1000128, doi:10.1371/journal.pgen.1000128 (2008).

6 Chan, C. X. & Ragan, M. A. Next-generation phylogenomics. Biol Direct 8, 3, doi:10.1186/1745-6150-8-3 (2013).

7 Puigbò, P., Wolf, Y. I. & Koonin, E. V. The tree and net components of prokaryote evolution. Genome Biol Evol 2, 745-756, doi:10.1093/gbe/evq062 (2010).

8 Beiko, R. G., Harlow, T. J. & Ragan, M. A. Highways of gene sharing in prokaryotes. Proc Natl Acad Sci U S A 102, 14332-14337, doi:10.1073/pnas.0504068102 (2005).

9 Stiller, J. W. Experimental design and statistical rigor in phylogenomics of horizontal and endosymbiotic gene transfer. BMC Evol Biol 11, 259, doi:10.1186/1471-2148-11-259 (2011).

10 Wong, K. M., Suchard, M. A. & Huelsenbeck, J. P. Alignment uncertainty and genomic analysis. Science 319, 473-476, doi:10.1126/science.1151532 (2008). 107 11 Shih, P. M. et al. Improving the coverage of the cyanobacterial phylum using diversity- driven genome sequencing. Proc Natl Acad Sci U S A 110, 1053-1058, doi:10.1073/pnas.1217107110 (2013).

12 Wu, D. et al. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature 462, 1056-1060, doi:10.1038/nature08656 (2009).

13 Eyre, D. W. et al. A pilot study of rapid benchtop sequencing of Staphylococcus aureus and Clostridium difficile for outbreak detection and surveillance. BMJ Open 2:e001124, doi:10.1136/bmjopen-2012-001124 (2012).

14 Bonham-Carter, O., Steele, J. & Bastola, D. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Brief Bioinform 15, 890-905, doi:10.1093/bib/bbt052 (2014).

15 Haubold, B. Alignment-free phylogenetics and population genetics. Brief Bioinform 15, 407-418, doi:10.1093/bib/bbt083 (2014).

16 Vinga, S. & Almeida, J. Alignment-free sequence comparison-a review. Bioinformatics 19, 513-523 (2003).

17 Chan, C. X., Bernard, G., Poirion, O., Hogan, J. M. & Ragan, M. A. Inferring phylogenies of evolving sequences without multiple sequence alignment. Sci Rep 4, 6504, doi:10.1038/srep06504 (2014).

18 Felsenstein, J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17, 368-376 (1981).

19 Huelsenbeck, J. P., Ronquist, F., Nielsen, R. & Bollback, J. P. Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294, 2310-2314, doi:10.1126/science.1065889 (2001).

20 Huelsenbeck, J. P. & Ronquist, F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17, 754-755 (2001).

21 Fan, H., Ives, A. R., Surget-Groba, Y. & Cannon, C. H. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics 16, 522, doi:10.1186/s12864-015-1647-5 (2015).

22 Ren, J. et al. Inference of Markovian properties of molecular sequences from NGS data 108 and applications to comparative genomics. Bioinformatics 32, 993-1000, doi:10.1093/bioinformatics/btv395 (2016).

23 Song, K. et al. Alignment-free sequence comparison based on next-generation sequencing reads. J Comput Biol 20, 64-79, doi:10.1089/cmb.2012.0228 (2013).

24 Miller, R. G. Jackknife - Review. Biometrika 61, 1-15 (1974).

25 Wan, L., Reinert, G., Sun, F. & Waterman, M. S. Alignment-free sequence comparison (II): theoretical power of comparison statistics. J Comput Biol 17, 1467-1490, doi:10.1089/cmb.2010.0056 (2010).

26 Yi, H. & Jin, L. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res 41, e75, doi:10.1093/nar/gkt003 (2013).

27 Wang, H., Xu, Z., Gao, L. & Hao, B. A fungal phylogeny based on 82 complete genomes using the composition vector method. BMC Evol Biol 9, 195, doi:10.1186/1471-2148-9-195 (2009).

28 Jun, S. R., Sims, G. E., Wu, G. A. & Kim, S. H. Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution. Proc Natl Acad Sci U S A 107, 133-138, doi:10.1073/pnas.0913033107 (2010).

29 Sims, G. E., Jun, S. R., Wu, G. A. & Kim, S. H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Natl Acad Sci U S A 106, 2677- 2682, doi:10.1073/pnas.0813249106 (2009).

30 Sims, G. E. & Kim, S. H. Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs). Proc Natl Acad Sci U S A 108, 8329-8334, doi:10.1073/pnas.1105168108 (2011).

31 Horwege, S. et al. Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res 42, W7-11, doi:10.1093/nar/gku398 (2014).

32 Leimeister, C. A., Boden, M., Horwege, S., Lindner, S. & Morgenstern, B. Fast alignment- free sequence comparison using spaced-word frequencies. Bioinformatics 30, 1991-1999, doi:10.1093/bioinformatics/btu177 (2014).

33 Ulitsky, I., Burstein, D., Tuller, T. & Chor, B. The average common substring approach to phylogenomic reconstruction. J Comput Biol 13, 336-350, doi:10.1089/cmb.2006.13.336 (2006). 109 34 Russell, D. J., Way, S. F., Benson, A. K. & Sayood, K. A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences. BMC Bioinformatics 11, 601, doi:10.1186/1471-2105-11-601 (2010).

35 Leimeister, C. A. & Morgenstern, B. Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics 30, 2000-2008, doi:10.1093/bioinformatics/btu331 (2014).

36 Haubold, B., Pfaffelhuber, P., Domazet-Lošo, M. & Wiehe, T. Estimating mutation distances from unaligned genomes. J Comput Biol 16, 1487-1500, doi:10.1089/cmb.2009.0106 (2009).

37 Robinson, D. F. & Foulds, L. R. Comparison of phylogenetic trees. Math Biosci 53, 131- 147, doi:10.1016/0025-5564(81)90043-2 (1981).

38 Beiko, R. G. & Charlebois, R. L. A simulation test bed for hypotheses of genome evolution. Bioinformatics 23, 825-831, doi:10.1093/bioinformatics/btm024 (2007).

39 Bryant, D. & Steel, M. Computing the distribution of a tree metric. IEEE/ACM Trans Comput Biol Bioinform 6, 420-426, doi:10.1109/TCBB.2009.32 (2009).

40 Ragan, M. A. Phylogenetic inference based on matrix representation of trees. Mol Phylogenet Evol 1, 53-58, doi:10.1016/1055-7903(92)90035-F (1992).

41 Ragan, M. A., Bernard, G. & Chan, C. X. Molecular phylogenetics before sequences: oligonucleotide catalogs as k-mer spectra. RNA Biol 11, 176-185, doi:10.4161/rna.27505 (2014).

42 Gordon, D. M., Clermont, O., Tolley, H. & Denamur, E. Assigning Escherichia coli strains to phylogenetic groups: multi-locus sequence typing versus the PCR triplex method. Environ Microbiol 10, 2484-2496, doi:10.1111/j.1462-2920.2008.01669.x (2008).

43 Ochman, H. & Selander, R. K. Standard reference strains of Escherichia coli from natural populations. J Bacteriol 157, 690-693, doi:0021-9193/84/020690-04$02.00/0 (1984).

44 Lecointre, G., Rachdi, L., Darlu, P. & Denamur, E. Escherichia coli molecular phylogeny using the incongruence length difference test. Mol Biol Evol 15, 1685-1695 (1998).

45 Farris, J. S., Albert, V. A., Källersjö, M., Lipscomb, D. & Kluge, A. G. Parsimony jackknifing outperforms neighbor-joining. Cladistics 12, 99-124 (1996).

110 46 Shi, J., Zhang, Y., Luo, H. & Tang, J. Using jackknife to assess the quality of gene order phylogenies. BMC Bioinformatics 11, 168, doi:10.1186/1471-2105-11-168 (2010).

47 Woese, C. R. & Fox, G. E. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc Natl Acad Sci U S A 74, 5088-5090, doi:10.1073/pnas.74.11.5088 (1977).

48 Fuerst, J. A. & Sagulenko, E. Beyond the bacterium: planctomycetes challenge our concepts of microbial structure and function. Nat Rev Microbiol 9, 403-413, doi:10.1038/nrmicro2578 (2011).

49 Gunasinghe, U., Alahakoon, D. & Bedingfield, S. Extraction of high quality k-words for alignment-free sequence comparison. J Theor Biol 358, 31-51, doi:10.1016/j.jtbi.2014.05.016 (2014).

50 Dalquen, D. A., Anisimova, M., Gonnet, G. H. & Dessimoz, C. ALF--a simulation framework for genome evolution. Mol Biol Evol 29, 1115-1123, doi:10.1093/molbev/msr268 (2012).

51 Tavaré, S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures on Mathematics in the Life Sciences 17, 57-86 (1986).

52 Beiko, R. G., Doolittle, W. F. & Charlebois, R. L. The impact of reticulate evolution on genome phylogeny. Syst Biol 57, 844-856, doi:10.1080/10635150802559265 (2008).

53 Kupczok, A., Schmidt, H. A. & von Haeseler, A. Accuracy of phylogeny reconstruction methods combining overlapping gene data sets. Algorithms Mol Biol 5, 37, doi:10.1186/1748- 7188-5-37 (2010).

54 Bryant, D. & Steel, M. Computing the distribution of a tree metric. IEEE/ACM Trans. Comput. Biol. Bioinform. 6, 420-426, doi:10.1109/TCBB.2009.32 (2009).

111

4.2 Concluding remarks

This chapter demonstrates that AF approaches can be used to infer accurate phylogenies of hundreds of microbial genomes, based on sets of whole-genome sequences. Here I found that most AF methods are sensitive to large amount of LGT but robust against genome rearrangement, and I identified optimal parameters for the methods that require parameter selection. I found that AF methods are highly scalable, although their scalability varies between the two families (i.e., word- count and match length). Finally, I introduced a new application of the jackknife technique to provide node support values to subtrees inferred by AF approaches, and showed that these values are biologically meaningful. However, the discrepancies between the phylogenetic trees inferred by MSA-based and AF methods in the empirical datasets highlight the complexity of microbial evolution, in which common evolutionary phenomena such as LGT are not captured adequately in a strictly bifurcating tree-like structure of phylogenetic relationships. A more-flexible structure that incorporates these events, e.g. a network, would represent a biologically more-realistic phylogeny of microbial genomes.

112 CHAPTER 5: ALIGNMENT-FREE NETWORKS: THE NEXT MICROBIAL PHYLOGENOMICS

In Chapter 4, I demonstrated that AF approaches can infer accurate phylogenetic trees of complete microbial genomes in a scalable manner. However, microbial evolution does not follow a tree-like structure, notably because of widely spread LGT events, and a network structure is more appropriate to realistically represent the evolution of microbial genomes.

This chapter is presented in the form of two manuscripts. In this chapter I address the last two Aims of my project (Chapter 1, section 1.5). The first manuscript presents proof-of-concept that AF approaches can generate phylogenetic networks to infer phylogenetic relationships among microbial genomes. This work was published in the journal Scientific Reports (doi: 10.1038/srep28970) and edited for this thesis. As the first author of this paper, I designed the framework, prepared the manuscript and conducted all the analyses.

The second manuscript includes a study that present an extended AF approaches that can be used to infer phylogenetic networks quickly and accurately for large-scale microbial whole-genome data. This work was published in bioRxiv (doi: 10.1101/125237) and edited for this thesis. As the first author of this paper, I designed the framework, conducted all the analysis and prepare the manuscript. The supplementary material for this manuscript is presented in Appendix C.

113 5.1 Recapitulating phylogenies using k-mers: from trees to networks

5.1.1 Abstract

Ernst Haeckel based his landmark Tree of Life on the supposed ontogenic recapitulation of phylogeny, i.e. that successive embryonic stages during the development of an organism re-trace the morphological forms of its ancestors over the course of evolution. Much of this idea has since been discredited. Today, phylogenies are often based on families of molecular sequences. The standard approach starts with a multiple sequence alignment, in which the sequences are arranged relative to each other in a way that maximises a measure of similarity position-by-position along their entire length. A tree (or sometimes a network) is then inferred. Rigorous multiple sequence alignment is computationally demanding, and genomic processes that shape the genomes of many microbes (bacteria, archaea and some morphologically simple eukaryotes) can add further complications. In particular, recombination, genome rearrangement and lateral genetic transfer undermine the assumptions that underlie multiple sequence alignment, and imply that a tree-like structure may be too simplistic. Here, using genome sequences of 143 bacterial and archaeal genomes, we construct a network of phylogenetic relatedness based on the number of shared k-mers (subsequences at fixed length k). Our findings suggest that the network captures not only key aspects of microbial genome evolution as inferred from a tree, but also features that are not treelike. The method is highly scalable, allowing for investigation of genome evolution across a large number of genomes. Instead of using specific regions or sequences from genome sequences, or indeed Haeckel’s idea of ontogeny, we argue that genome phylogenies can be inferred using k-mers from whole-genome sequences. Representing these networks dynamically allows biological questions of interest to be formulated and addressed quickly and in a visually intuitive manner.

5.1.2 Introduction

Ernst Haeckel coined the term Phylogenie to describe the series of morphological stages in the evolutionary history of an organism or group of organisms1. In his Tree of Life published 150 years ago2, Haeckel postulated that living organisms trace their evolutionary origin(s) along three distinct lineages (Plantae, Protista and Animalia) to a “common Moneran root of autogonous organisms”. In some (but not all) later works (e.g. in 18683) he allowed that different Monera may have arisen independently by spontaneous generation. Either way, these views accord with the Larmackian

114 notion of a built-in direction of evolution from morphologically simple “lower” organisms to more- complex “higher” forms4.

Haeckel through his “Biogenetic Law” advocated that “ontogeny recapitulates phylogeny”2: that the embryonic series of an organism is a record of its evolutionary history. Under this view, morphologies observed at different developmental stages of an organism resemble and represent the successive stages (including adult stages) of its ancestors over the course of evolution. Of course, he worked before the advent of genetics and the modern synthesis, and before it was appreciated that information on hereditary is carried by DNA and can be recovered by sequencing and statistical analysis. He could not have foreseen that these DNA sequences code for other biomolecules and control life processes, including his beloved developmental series and organismal phenotype, through vastly complex molecular webs of interactions. Nor could Haeckel have envisaged the scale of phylogenetic analysis that can be carried out today using these DNA sequences across multiple genomes, made possible by the advent of high-throughput sequencing and computing technologies.

Fast-forwarding 150 years, phylogenetic inference based on comparative analysis of biological sequences is now a common practice. The similarity among sequences is commonly interpreted as evidence of homology5,6, i.e. that they share a common ancestry. From the earliest days of molecular phylogenetics, multiple sequences have been aligned7,8 to display this homology position-by- position along the length of the sequences. That is, the residues are arranged relative to each other such that the best available hypothesis of homology is achieved at every position (column) of the alignment. By default, it is assumed that the best alignment can be achieved simply by displaying the sequences in the same direction, and inserting gaps where needed (to represent insertions and deletions). This assumption is largely valid when working with highly conserved orthologs of any source, and with exons or proteins of morphologically complex eukaryotes. However, microbial genomes are often affected by recombination and rearrangement9, undermining the assumption of homology along adjacent positions, while lateral genetic transfer would not be represented by a common treelike process10-13. As Haeckel observed when he drew his Tree2, biological evolution can be anything but straightforward, and these complications have become ever more- complicated14,15.

Alternative approaches for inferring and representing phylogenies are available. An attractive strategy that addresses the issue of full-length alignability is to compute relatedness among a set of sequences based on the number or extent of k-mers (short sub-sequences of a fixed length k) that they share. Such approaches avoid multiple sequence alignment, and for this reason are termed

115 alignment-free. As opposed to heuristics in multiple sequence alignment, these methods provide exact solutions. Various modifications are available, e.g. the use of degenerate k-mers, scoring match lengths rather than k-mer composition, and grammar-based techniques; see recent reviews16,17 for more detail. Methods for inferring lateral genetic transfer have also been developed18,19. Importantly, evolutionary relationships can also be depicted as a network, with taxa and relationships represented respectively as nodes and edges20-24, rather than as a strictly bifurcating tree. Using simulated and empirical sequence data, we recently demonstrated that alignment-free approaches can yield phylogenetic trees that are biologically meaningful25-27. We find that these approaches are more robust to genome rearrangement and lateral genetic transfer, and are highly scalable25,26, a much-desired feature given the current deluge of sequence data facing the research community28. Here we extend the alignment-free phylogenetic approaches on 143 bacterial and archaeal genomes to generate a network of phylogenetic relatedness, and assess biological implications of this network relative to the phylogenetic tree. The phylogenetic relationships among these genomes have been carefully studied using the standard approach based on multiple sequence alignment10 and an alignment-free approach25; this dataset thus provides a good reference for comparison.

5.1.3 Methods

Using 143 complete genomes of Bacteria and Archaea25, we inferred the relatedness of these # 29,30 genome sequences using an alignment-free method based on the !" statistic . We computed a # !" distance, d for each possible pair of 143 genomes based on the presence of shared 25-mers using jD2Stat version 1.0 (http://bioinformatics.org.au/tools/jD2Stat/)26 and following Bernard et al.25. Here the distance d is normalised based on genome sizes and the probabilities that corresponding k-mers occur in the compared sequences29,30; d ranges between 0 (i.e. two genomes are identical) and 12 (< 0.001% 25-mers are shared between the two genomes). For a pair of genomes a and b, we transformed dab into a similarity measure Sab, in which Sab = 10 – dab. We ignore instances of d >10, as these pairs of sequences share < 0.01% of 25-mers (i.e. there is little evidence of homology). To visualise the phylogenetic relatedness of these genomes, we adopted the D3 JavaScript library for data-driven documents (https://d3js.org/). In this network, each node represents a genome, and an edge connecting two nodes represents the qualitative evidence of shared k-mers between them. We set a threshold function t for which only edges with S ≥ t are displayed on the screen. Changing t dynamically changes the network structure. The resulting dynamic network is available at http://bioinformatics.org.au/tools/AFnetwork/.

116

5.1.4 Results and discussion

Figure 1 shows the phylogenetic tree of the 143 Bacteria and Archaea genomes that we previously # 29,30 10 inferred using an alignment-free method based on the !" statistic . In an earlier study , a supertree was generated for these genomes, summarising 22,432 protein phylogenies. Incongruence between the two trees was observed in 42% of the bipartitions, most of which are at terminal branches25. The alignment-free tree (Figure 1) recovers 13 out of the 15 “backbone” nodes10, distinct clades of Archaea and Bacteria, a monophyletic clade of Proteobacteria, and the lack of resolution between gamma- and beta-Proteobacteria, in agreement with previously published studies; as such, this tree captures most of the major biological groupings of Bacteria and Archaea as presently understood.

117 Euryarchaeota

taH l β-Proteobacteria Nanoarchaeota tr.De γ-Proteobacteria 1 Crenarchaeota hicuss

NitrosomonaseuropaeaATCC197

β-Proteobacteria iiDSM266 Euryarchaeota Chromobact autotrop

Neisseriameningits nameisMC58 19 M3638 iGo1 T3 asch Neisseriameningits nameisZ2491 erm XylellafastidiosaTemecula1 usDSM4304 Crenarchaeota Planctomycetes iAV M iO d r Co i GE5 Ralstoniasolanacea terth sjann xiellaburnetiiRSA493 r.IM2 eriumv oshi usP2 c Xy ik i tr.7 P r r s cinamaze obac occu eudo r busfulg lellafastidiosa9a5c furiosusDS BordetellaparapertBordetellapertussisToha aiis K1 lumst o Xanthomonascampestrispv.campestrisstr.ATCC3 iolaceumATC l fata Chlamydiales Bordetel erm 3/CX

monassyringae osa nix cus ol Pseudomonasaeruginos pyruskandle lcaniumGSS1 029 L occusho aeog vo XanthomonasaxonopodispPseudomonasputs nameaKT2440 eumequitansKin4- NRC-1 labronchisepticaRB50 anocaldoc bustokod igg rumGMI1000 sma

MethanosarcinaacetivoransC2A Methan Arch Pyrococcusabyssi ethanoth Pyrococ 18 Pyroc oniaeCW Methano archa C12472 M iumsp. pv.tomatostr Meth Sulfolobuss ussis1282 0 Spirochaetales Sulfolo Vibriochol 0 1 Aeropyrumper 6 61 1 0 ermopla 15 Nano Pyrobaculumaerophi 100 Shewanellaon 95 10 Th VibrioparahaemolyticusRIMD22106331 ma 10 100 lobacterdopirellulabalticaSH1 aeAR39 eraeO1biovarElTorstr.N16961 50 2 ThermoplasmaacidophilumDSM1728 .DC3000aPA 0 8 Ha 2 42 2 13 Pasteurellamultocs nameasubsp.multocs nameastr.Pm70v.citris Rho Bacteroidetes 30 ChlamydiatrachomatisD/UW- 01 8 100 1 ChlamydiamuridarumN 47 eidensisMR 100 100 100 Chlamydophilapneum tr.306 100 23 4 ChlamydophilapneumoniaeTW-183 3913 100 4 39 valisW83 VibriovulnificusYJ016 6 ChlamydophilapneumoniaeJ138 97 62 ChlamydophilacaviaeGPIC VibriovulnificusCMCP6 -1 4 100 100 58 Chlamydophilapneumoni 99 100 HaemophilusinfluenzaeRdKW20 41 BorreliaburgdorferiB31 Photorhabdusluminescenssubsp.laumondiiTTO1 10027 Leptospirainterrogansserovarlaistr.56601 Haemophilusducreyi35000HP 100 PorphyromonasgingiroidesthetaiotaomicronVPI-5482 21 Bacte 100100 100 100 Prochlorococcusmarinusstr.MIT9313 96 62 85 Synechococcussp.WH8102 Cyanobacteria 100 63 ThermosynechococcuselongatusBP-1 YersiniapestisCO92 79 GloeobacterviolaceusPCC7421 γ-Proteobacteria YersiniapestisKIM10+ 35 39 100 Nostocsp.PCC7120 Shigellaflexneri2astr.301 71 Synechocystissp.PCC6803 96 48 ococcusmarinussubsp.pastorisstr.CCMP1986 100 8 Prochlor sp.marinusstr.CCMP1375 Shigellaflexneri2astr.2457T 100 Prochlorococcusmarinussub 100 41 100 EscherichiacoliCFT073 17 DeinococcusradioduransR11 Deinococcus 100 100 90 100 100 86 MycoplasmapneumoniaeM129 EscherichiacoliO157H7EDL933 59 .Sakai 96 MycoplasmagenitaliumG37 herichiacoliO157H7str 92 100 33 Mycoplasmagal Esc G1655 83 7 Ureap lisept K-12substr.M 50 lasmaparv icumstr.R iacolistr. histr.CT18 18 1 MycoplasmapulmonisUums Escherich rovarTyp 100 53 erovar3str.ATCC700 tericase phistr.Ty2 Mycop 7 100 Fusobaclasmap s 97 3 ABCTIP 970 ntericasubsp.en aserovarTyriumstr.LT2 2 100 4 enetransHF- ellae enteric 74 9 Streptoc teriumnu mon us 78 Strep Sal 0 9 100 occuspy 2 entericasubsp. rovarTyphimu 10 0 67 99 4 Streptococtoc cleatumsubsp.nucleatumATCC25586 5 onella 1 100 30 occusp 100 7 7 ogenesSSI- Salm 0 Streptococcuspyoge biontofGlossinabrevipalpitr.Sg 99 5 100 icasubsp.entericase as 0 Stre cusp yogene 19 00 48 100 Streptococcuptococcu 1 nidiaendosymdidatusBlochmanniafloridanhidicol sp.APS 1 Strept yog sM 100 enesMGAS8232GAS31 Salmonellaenteraglossi Can Buchneraaphidicolastr.Bp174095 Stre rthi 10 100 s uchneraap 92 55 96 Streptococcuococc agalac 5 ATCC7008 7 0 99 p nesM1GASstr.SF370 B 77 Lact to Wiggleswo coccu sagal 71 En uspneumonitiaeNEM316 genesDSM 100 98 Staphy o actiae2603V BuchneraaphidicolaTC11168 54 7 terococcucoccuslactissubsp C 4 10 Staphylococcusaureussubsp.aureusN315spneu 100 93 1 Staphyloc iN 96 7 0 uccino rcular 43 s liloti1021 00 smutansUA159 jun lla icobacterpylori266 0 100 Sta 100 14 100 L lococ m

1 ae Hel 100 8 ListeriainnocuaClip1is oniaeTIGR4 0 phyloco /R HelicobacterpyloriJ99.1str.16M 1 Lacto teriamon sfae R iumme 42 0 Bacillus 6 100 4 Woline bv sCB15 100 Bac cusau 5 100 Baci occ ca 100 B O erjejunisubsp.je Clost lisV5 Cl aci bacillusplanta Cl illusanthr .lact

The ceano c usau reus v l cus ost lussubtil 5 ostridi llushal ocyto Sinorhizob Brucellasuis1330 N cereusA 83 isIl14

T e subsp.aur r ridiumacetob reuss ampylobact rid moanaerobacterten HelicobacterhepaticusATCC51449 e pid

C ercrescentu istr.MadridE 37R bac 03 a g

r iumperfringens C1551 C270 en e umtetaniE88 oduran acisstr.Ames rucellamelitensis umstr.Nichols p rmidisATub B icolorA3 issubsp.subti

e illu esEGD l T l sp. grobacteriumfabrumstr.C58ci NC iumjaponicumUSDA110 CC14579r 1262 eusMu50

A losisH umWCFS1 m sihe

robiumtepidumTLS m au

Caulobact isAF2122/97 u quifexaeolicusVF5 i r reusMW2 ε-Proteobacteria sC - CC1 escoe

A e e ulosisCD y ngu Chlo t utylicumATCC8

Rickettsiaconoriistr.Malish7bsp.pallid bercu ensisHTE83- c 125 lo

MesorhizobiumlotiMAFF303099u u a 2228

m Bradyrhizob b

u l o str.13 isstr.168

Rickettsiaprowazeki c

ThermotogamaritimaMSB8 riumt y

teri gc

α-Proteobacteria e t M Low G+C TropherymawhippleiTW08/27 ongensisMB4 Tropherymawhippleistr.Twist Streptomyc

bac 1 cteriumtuberc Firmicutes

24

Mycobacteriumbov CorynebacteriumefficiensYS-314 ycobac

StreptomycesavermitilisMA-4680

Bifido

Chlorobiales M ycoba Treponemapallidums Spirochaetales M CorynebacteriumglutamicumATCC13032 Aquificaceae CorynebacteriumdiphtheriaeNCTC13129 Thermotoga High G+C Firmicutes

Figure 1. The alignment-free phylogenetic tree topology of the 143 Bacteria and Archaea genomes based on D2S statistic, modified based on the tree in Bernard et al.22 ; jackknife support at each internal node is shown. Each phylum is represented in a distinct colour, and the backbones identified in Beiko et al. 9 are shown on the internal node with black filled circles. The association of Coxiella burnetii and Nitrosomonas europaea is highlighted with a green background.

118 Figure 2 shows the network of phylogenetic relatedness of the same 143 genomes; a dynamic view of this network is available at http://bioinformatics.org.au/tools/AFnetwork/. As in our tree (Figure 1), Archaea and Bacteria form two separate paracliques; even at t = 0, we found only one archaean isolate (the euryarchaeote Methanocaldococcus jannaschii DSM 2661) linked to the bacterial groups Thermotogales and Aquificales25. Upon reaching t = 3, most of the 14 phyla have formed distinct densely connected subgraphs in our network, i.e. Cyanobacteria and Chlamydiales form cliques at t = 1.5 and all subgroups of Proteobacteria form a large paraclique with the Firmicutes at t = 2. Four Escherichia coli and two Shigella isolates, known to be closely related, form a clique up to t = 8.5. Interestingly, this network also showcases the extent that genomic regions are shared among diverse phyla, e.g. the high extent of genetic similarity among Proteobacteria versus the low extent between Chlamydophilia and Cyanobacteria. Our observations largely agree with published studies10,25, but also highlight the inadequacy of representing microbial phylogeny as a tree. For instance, in the tree Coxiella burnetii, a member of the gamma-Proteobacteria, is grouped with Nitrosomonas europaea of the alpha-Proteobacteria (marked with an asterisk in Figure 1); in the network, the strongest connection of C. burnetii is with Wigglesworthia glossinidia, a member of the gamma-Proteobacteria (marked with an asterisk in Figure 2) at t = 2. Both W. glossinidia and C. burnetii are parasites; the W. glossinidia genome (0.7 Mbp) is highly reduced31 and the C. burnetii genome (2 Mbp) is proposed to be undergoing reduction32. As both the tree (Figure 1) and network presented here were generated using the same alignment-free method, the contradictory position of C. burnetii is likely caused by the neighbour-joining algorithm used for tree inference25. In this scenario, the C. burnetii genome connects with N. europaea because it shares high similarity with N. europaea and Neisseria genomes of the beta-Proteobacteria (S between 1.43 and 1.68), second only to W. glossinidia (S = 2.05), and because it shares little or no similarity with other genomes of gamma-Proteobacteria that are closely related to W. glossinidia, i.e. Buchnera aphidicola isolates (average S = 0.63) and “Candidatus Blochmannia floridanus” (S = 0).

By changing the threshold t, we can dynamically visualise changes in the network structure. These changes are not random, but appear to correlate to the evolutionary history of the species. At t = 0, Archaea and Bacteria form two distinct paracliques, linked only by two edges, and the Planctomycetes isolate forms a singleton. When we increase t from 1 to 2, the Archaea and Bacteria paracliques quickly dissociate from each other; within the Bacteria, cliques of Chlamydiales and Cyanobacteria are formed and the Spirochaetales become isolated. Going from t = 2 to t = 3 we observe a scission between Firmicutes and Proteobacteria, and at t > 3 all classes of Proteobacteria start to form respective paracliques. The separation (as t is incremented) of a densely connected subgraph involving all representatives of a phylum, from the rest of the network mimics the divergence of this phylum from a common ancestor. Because the similarity measures

119 do not have a unit (such as number of substitutions per site), it is not straightforward to interpret S as an evolutionary rate or divergence time. A comprehensive comparative analysis between our network here and one that is generated using multiple sequence alignment is beyond the scope of this work. However, our findings suggest that our alignment-free network yields snapshots of biologically meaningful evolutionary relationship among these genomes, and that increasing the threshold based on the proportion of shared k-mers recapitulates the progressive separation of genomic lineages in evolution.

Figure 2. Alignment-free phylogenetic network of the 143 Bacteria and Archaea genomes based on D2S statistic using 25-mers, at t = 2. Each phylum is represented in a distinct colour, each node represents a genome and an edge represents a qualitative evidence of shared 25-mers between two genomes. The association between Coxiella burnetii and Wigglesworthia glossinidia is marked with an asterisk.

The alignment-free network reconstructed using whole-genome sequences thus recovers phylogenetic signals that cannot be captured in a binary tree. Using this approach, we generated the network in < 30 minutes (wall time); a whole-genome alignment of 143 sequences would have taken days, and even then, the alignment would be difficult to interpret given the genome dynamics in Bacteria and Archaea9-13,33. One can imagine inferring a network of thousands of microbial genomes in a few hours using distributed computing. More importantly, the network can be visualised dynamically, explored interactively and shared.

Other biological questions could be addressed by linking the k-mers to their genomic locations and annotated genome features, e.g. in a relational database34. For instance, we could use such a

120 database to compare thousands of isolates and identify core gene functions for a specific phylum or genus, or exclusive versus non-exclusive functions in bacterial pathogens, in a matter of seconds. We can also use k-mers to quickly search for biological information e.g. functions relevant to lateral genetic transfer, recombination or duplications.

In contrast to Haeckel’s “Biogenetic Law”, k-mers used in this way recapitulate phylogenetic signal, not ontogeny. Alignment-free approaches generate a biologically meaningful phylogenetic inference, and are highly scalable. More importantly, representing alignment-free phylogenetic relationships using a network captures aspects of evolutionary histories that are not possible in a tree. As more genome data become available, Haeckel’s goal of depicting the History of Life is closer to reality.

5.1.5 References

1 Dayrat, B. The roots of phylogeny: how did Haeckel build his trees? Syst. Biol. 52, 515- 527 (2003).

2 Haeckel, E. Generelle Morphologie der Organismen. Allgemeine Grundzüge der organischen Formen-Wissenschaft, mechanisch begründet durch die von Charles Darwin reformierte Descendenztheorie. Bd. 1 und 2., (Reimer, 1866).

3 Haeckel, E. Natürliche Schöpfungsgeschichte. (Reimer, 1868).

4 Burkhardt, R. W., Jr. Lamarck, evolution, and the inheritance of acquired characters. Genetics 194, 793-805, doi:10.1534/genetics.113.151852 (2013).

5 Fitch, W. M. Homology a personal view on some of the problems. Trends Genet 16, 227- 231 (2000).

6 Hall, B. K. Homology: the hierarchical basis of comparative biology. (Academic Press, 1994).

7 Notredame, C. Recent progress in multiple sequence alignment: a survey. Pharmacogenomics 3, 131-144, doi:10.1517/14622416.3.1.131 (2002).

8 Notredame, C. Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol 3, e123, doi:10.1371/journal.pcbi.0030123 (2007).

9 Darling, A. E., Miklós, I. & Ragan, M. A. Dynamics of genome rearrangement in bacterial populations. PLoS Genet 4, e1000128, doi:10.1371/journal.pgen.1000128 (2008).

10 Beiko, R. G., Harlow, T. J. & Ragan, M. A. Highways of gene sharing in prokaryotes. Proc Natl Acad Sci U S A 102, 14332-14337, doi:10.1073/pnas.0504068102 (2005).

121 11 Doolittle, W. F. Phylogenetic classification and the universal tree. Science 284, 2124-2129 (1999).

12 Koonin, E. V. Horizontal gene transfer: essentiality and evolvability in prokaryotes, and roles in evolutionary transitions. F1000Res 5, 1805, doi:10.12688/f1000research.8737.1 (2016).

13 Puigbò, P., Lobkovsky, A. E., Kristensen, D. M., Wolf, Y. I. & Koonin, E. V. Genomes in turmoil: quantification of genome dynamics in prokaryote supergenomes. BMC Biol 12, 66, doi:10.1186/s12915-014-0066-4 (2014).

14 Adl, S. M. et al. The revised classification of eukaryotes. J. Eukaryot. Microbiol. 59, 429- 493, doi:10.1111/j.1550-7408.2012.00644.x (2012).

15 Spang, A. et al. Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature 521, 173-179, doi:10.1038/nature14447 (2015).

16 Bonham-Carter, O., Steele, J. & Bastola, D. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Brief Bioinform 15, 890-905, doi:10.1093/bib/bbt052 (2014).

17 Haubold, B. Alignment-free phylogenetics and population genetics. Brief Bioinform 15, 407-418, doi:10.1093/bib/bbt083 (2014).

18 Cong, Y., Chan, Y. B. & Ragan, M. A. A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF. Sci Rep 6, 30308, doi:10.1038/srep30308 (2016).

19 Domazet-Lošo, M. & Haubold, B. Alignment-free detection of local similarity among viral and bacterial genomes. Bioinformatics 27, 1466-1472, doi:10.1093/bioinformatics/btr176 (2011).

20 Corel, E., Lopez, P., Méheust, R. & Bapteste, E. Network-thinking: graphs to analyze microbial complexity and evolution. Trends Microbiol 24, 224-237, doi:10.1016/j.tim.2015.12.003 (2016).

21 Dagan, T. Phylogenomic networks. Trends Microbiol 19, 483-491, doi:10.1016/j.tim.2011.07.001 (2011).

22 Huson, D. H. & Bryant, D. Application of phylogenetic networks in evolutionary studies. Mol. Biol. Evol. 23, 254-267, doi:10.1093/molbev/msj030 (2006).

23 Huson, D. H. & Scornavacca, C. A survey of combinatorial methods for phylogenetic networks. Genome Biol Evol 3, 23-35, doi:10.1093/gbe/evq077 (2011).

24 Kunin, V., Goldovsky, L., Darzentas, N. & Ouzounis, C. A. The net of life: reconstructing the microbial phylogenetic network. Genome Res. 15, 954-959, doi:10.1101/gr.3666505 (2005).

122 25 Bernard, G., Chan, C. X. & Ragan, M. A. Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer. Sci. Rep. 6, 28970, doi:10.1038/srep28970 (2016).

26 Chan, C. X., Bernard, G., Poirion, O., Hogan, J. M. & Ragan, M. A. Inferring phylogenies of evolving sequences without multiple sequence alignment. Sci Rep 4, 6504, doi:10.1038/srep06504 (2014).

27 Ragan, M. A., Bernard, G. & Chan, C. X. Molecular phylogenetics before sequences: oligonucleotide catalogs as k-mer spectra. RNA Biol 11, 176-185, doi:10.4161/rna.27505 (2014).

28 Chan, C. X. & Ragan, M. A. Next-generation phylogenomics. Biol Direct 8, 3, doi:10.1186/1745-6150-8-3 (2013).

29 Reinert, G., Chew, D., Sun, F. & Waterman, M. S. Alignment-free sequence comparison (I): statistics and power. J Comput Biol 16, 1615-1634, doi:10.1089/cmb.2009.0198 (2009).

30 Wan, L., Reinert, G., Sun, F. & Waterman, M. S. Alignment-free sequence comparison (II): theoretical power of comparison statistics. J Comput Biol 17, 1467-1490, doi:10.1089/cmb.2010.0056 (2010).

31 Akman, L. et al. Genome sequence of the endocellular obligate symbiont of tsetse flies, Wigglesworthia glossinidia. Nat Genet 32, 402-407, doi:10.1038/ng986 (2002).

32 Seshadri, R. et al. Complete genome sequence of the Q-fever pathogen Coxiella burnetii. Proc Natl Acad Sci U S A 100, 5455-5460, doi:10.1073/pnas.0931379100 (2003).

33 Dagan, T. & Martin, W. The tree of one percent. Genome Biol 7, 118, doi:10.1186/gb- 2006-7-10-118 (2006).

34 Greenfield, P. & Roehm, U. Answering biological questions by querying k-mer databases. Concurr. Comput. Pract. Exper. 25, 497-509, doi:10.1002/cpe.2938 (2013).

35 Bernard, G., Chan, C. X. & Ragan, M. A. 143 Prokaryote genomes. doi:10.14264/uql.2016.908 (2016).

36 Bernard, G., Chan, C. X. & Ragan, M. A. Alignment-free network of 143 prokaryote genomes. doi:10.14264/uql.2016.952 (2016).

123 5.2 K-mer similarity, networks of microbial genomes and taxonomic rank

5.2.1 Abstract

Alignment-free (AF) methods have recently been adopted to infer phylogenetic trees. However, the evolutionary relationships among microbes, impacted by common phenomena such as lateral genetic transfer and rearrangement, cannot be adequately captured in a strictly tree-like structure. Bacterial and archaeal genomes consist of highly conserved regions, e.g. ribosomal RNA genes (commonly used as phylogenetic markers), more-variable regions and extrachromosomal elements, i.e. plasmids (that contain genes critical under a selective condition e.g. antibiotic resistance). The impact of these elements on genome-scale inference of microbial phylogeny remains little known. Here, using an AF approach, we inferred phylogenomic networks of microbial life based on 2785 completely sequenced bacterial and archaeal genomes, and systematically assessed the impact of ribosomal RNA genes and plasmid sequences in this network. Our results indicate that k-mer similarity can correlate with taxonomic rank of microbes. Using a relational database approach, we linked the implicated k-mers to annotated genomic regions (thus functions), and defined core functions in specific phyletic groups and genera. We found that, in most phyla, highly conserved functions are often related to Amino acid metabolism and transport, and Energy production and conversion. Our findings indicate that AF phylogenomics can be used to infer reticulate relationships in a scalable manner and provide new perspective into microbial biology and evolution.

5.2.2 Introduction

Genome evolution in microbes involves highly dynamic molecular mechanisms including genome rearrangement and lateral genetic transfer (LGT). These mechanisms may violate the implicit assumption of full-length contiguity in multiple sequence alignment (MSA), a common step in phylogenetic analysis. Furthermore, MSA-based approaches necessitate heuristic methods e.g. Bayesian inference in reconstructing phylogenies, which are not scalable to the quantity of existing and forthcoming genome data1,2. An alternative strategy is to infer evolutionary relatedness based on shared subsequences of fixed length, known as k-mers, i.e. alignment-free (AF) methods3. AF approaches provide deterministic solutions (i.e. pairwise distances between genomes based on shared k-mers) which can be directly used in deriving a phylogenomic network4.

In the past decades, AF approaches have been used in phylogenomics to infer phylogenetic trees of evolving sequences2, complete genomes5-7 and NGS data8. The AF approaches used in phylogenomics can be classified into two categories3, one based on the count of k-mers9,10 and the

124 other based on match lengths11,12. Methods in both categories have been shown to be scalable and accurate in inferring phylogenies at both gene and genome level2,10 while being robust to complex evolutionary event such as LGT or rearrangement7.

However, the evolution of microbial genomes is known to not follow a tree-like structure, notably because of widely spread LGT events13,14, and a network structure to represent phylogenies has been increasingly chosen over a traditional tree representation since the emergence of genome data15. Different types of phylogenetic networks have been developed to increase our understanding of microbial evolution, including genome networks15 and sequence-similarity networks16. These networks, by allowing more than one connection per node, i.e. sequences or genomes, can be used to visualise vertical phylogenetic signal, from an inferred (but un-observed) common ancestor to organisms, and lateral signal between organisms. But the phylogenetic signal used to build these networks is generally inferred using BLAST hits17 and, therefore, based on sequence alignment.

As a proof of concept, we previously generated an AF phylogenetic network for 143 bacterial and 18 # archaeal genomes , using pairwise distances based on the !" statistic. By varying similarity thresholds in displaying the network, we could easily capture changes in the network structure, e.g. cliques, which reflect evolutionary events and dynamics of microbial genomes. For instance, we recovered the progressive separation of the different genomic lineages throughout their evolution18 and showcased particular relationships between isolates not observed using a classical tree structure7.

Highly conserved regions such as rRNA genes have long been used as phylogenetic markers for the inference of trees, and indeed our current view of the Tree of Life is based on ribosomal proteins19. However, trees based solely on this or other markers represent only a small fraction of the total genomic information20. On the other hand, variable regions or exogenous genetic material, i.e. plasmid genomes, are rarely taken into account when inferring phylogenies. Plasmids genomes are known to be important agents of LGT in microbes21,22 as well as a common vector of antibiotic resistance23, and a better understanding of their contribution to microbial evolution is an urgent matter due to the widespread of antibiotic resistance24. The AF approaches allows for the comparison of whole genomes, including exogenous sequences, with good computational performance but they do not keep information related to the k-mers locations3,7. Without positional information, the contribution of specific regions, such as rRNA genes or those having arisen from exogenous genetic material, to the phylogenetic signal captured by AF methods remains difficult to assess.

125 Here, to investigate the impact of plasmids and highly conserved genes in phylogenomic inference, using 2785 complete bacterial genomes we inferred AF phylogenomic networks using (a) all genome data including plasmids, (b) chromosomal sequences without ribosomal RNA genes, (c) only ribosomal RNA genes and (d) only plasmid sequences. We systematically assessed the impact of rRNA genes and plasmids on the overall microbial phylogenomic network. Using an advanced database approach, we investigated the core functions that are specific to particular phyletic groups or genera based on the shared k-mers.

5.2.3 Results

To infer a phylogenomic network, we first calculated a pairwise distance d between each genome # based on the !" distance using 25-mers (see Methods). For each pair of genomes a and b we transformed $%& into a similarity value '%& and generated a similarity network, following our earlier approach18; we consider this network to depict phylogenetic relatedness among these genomes, i.e. to be a phylogenomic network. Here we define a threshold t for which only edges with ' ≥ ) are considered in the network. To compare our results at the genome and phylum levels, we generated I-networks in which a node represents a distinct genome isolate and an edge between two nodes (isolates) indicates evidence of shared k-mers, and P-networks in which a node represents a distinct phylum and an edge represents the number of isolates that share k-mers with isolates of another phylum (see Methods). We then compared the k-mer networks based on the topological differences between them at different t. All the I- and P-networks of these 2705 genome isolates are available at http://espace.library.uq.edu.au/view/UQ:54303725.

5.2.3.1 AF networks of microbial evolution To infer a phylogenomic network of prokaryotes, we used a dataset of 2785 completely sequenced microbial genomes (2619 Bacteria, 176 Archaea) as of 31 January 2016 (Supplementary Table S1). To eliminate redundancy among the data, we kept only one genome where an identical # genome (from another isolate) was present (!" distance = 0). We also removed genomes with # little evidence of shared k-mers (!" distance > 10); these genomes share ≤ 0.01% of 25-mers with # any other genomes (i.e. there is little evidence of homology). Below the threshold t = 0 (e.g. !" > 10) the AF network is almost a maximum clique, and the size of the network increases dramatically # 8 without any addition of meaningful information. The value of !" depends on the k-mer size and this threshold limit can vary according to k. Following this filtering step, we took a total of 2705 genomes forward into subsequent analyses. For each network, we systematically assessed the number of connected nodes (c), number of edges (e), maximum clique size (z) and maximum

126 number of cliques (n) across varying levels of the similarity-score threshold t. Here we required a clique to contain three or more edges, and we defined E as the average number of edges per node.

The network topology changes substantially with similarity threshold: at t = 0, c = 2705, e = 3835070 and z = 2700, compared to c = 1358, e = 9898 and z = 48 at t = 9 (Supplementary Table S2). As we required more-stringent threshold of shared similarity, the network became less- connected, and distinct cliques corresponding to diverse taxa (i.e. phyla, classes, genera) started to form. For example, Bacteria and Archaea form distinct cliques at t = 4, most phyla can be identified as distinct cliques or paracliques at t = 5, and all proteobacterial classes are separate from each other at t > 5.

The I-network is very densely connected at t = 0, with the maximum number of cliques n = 10. The value n is too high to be computed at t =1 or t = 2, but is 1662785 at t = 3 and decreases to 232 at t = 9 (Supplementary Table S2). Most isolates are members of a single large clique at t = 0 and t = 1, in which E > 1400; at t = 2, E = 736.3. The network becomes less dense at t = 3, with E = 112.8 (Supplementary Table S2). As this network of 2705 nodes remains too densely connected to be visualised and analysed directly, we generated the P-network using the same data, with each node representing a phylum. Figure 1 shows the P-network of the 2705 genomes at t = 3. The thickness of each edge represents the number of instances in which any two genomes (one for each phyla connected by the edge) have a similarity value ' ≥ ). Major phyla (e.g. β- and γ- Proteobacteria, Firmicutes, and Tenericutes) are clearly separated at t = 3. The thickest edge (in yellow) is between the β-Proteobacteria and γ-Proteobacteria, suggesting a high similarity among genomes between these groups. In addition, we also observed a large extent of shared 25-mers between Firmicutes and any of the proteobacterial classes.

127

# Figure 1. P-network of prokaryote genomes using !" with k = 25 based on whole-genome data, at t = 3. An edge between two nodes represents the number of connections between isolates from the two phyla. The size of a node is proportional to the number of isolates within the phylum.

5.2.3.2 Impact of rRNA genes To determine the contribution of the highly conserved rRNA genes to our AF networks, we computed AF networks using 2616 genomes (a subset of the 2705 above) upon excluding all rRNA gene sequences, and genomes with no annotated information (see Methods). The I-network of the genomes from which rRNA genes have been removed has a lower density than the one inferred using the whole dataset. Similarly to the previous I-network, here at t = 0, c = 2615, e = 1720082 and z = 1226, and these values decreased to c = 1290, e = 9008 and z = 47 at t = 9 (Supplementary Table S3). At t = 3, the I-network of the rRNA gene-free network has 38.9 edges per node on average, about 2.9-fold fewer than the 112.8 edges per node in the whole-genome network (Supplementary Table S2). Figure 2 shows the P-network of these 2616 genomes at t = 3. As in Figure 1, the thickest edge (in yellow), between β- and γ-Proteobacteria (Figure 2), indicates the largest number of instances of shared k-mers between genomes from these two groups. This P- network is less dense than the equivalent network based on the whole data (shown in Figure 1).

128 Although we observed fewer connections between the phyla after removal of rRNA sequences from the genome data, many of the major connections observed in Figure 1 remain, e.g. between β- and γ-Proteobacteria, and between Actinobacteria and γ-Proteobacteria. Thus the sharing of 25- mers contributing to these major connections extends beyond the commonly used phylogenetic marker of rRNA genes.

A network computed using only the rRNA genes sequences (see Methods) was denser than the two corresponding I-networks above. At t = 6, E is high at 854.4 (z = 1321; Supplementary Table S4), compared to 10.4 (z = 82) and 9.6 (z = 74) in the I-networks based on whole-genome and rRNA gene-removed data respectively. Supplementary Figure S1 shows the P-network of 2616 genome isolates based solely on rRNA genes at t = 6. Although most phyla are connected to each other (i.e. 2613 connected nodes and z = 1321 at t = 6), we observed a clear separation between Archaea and Bacteria. These results suggest that rRNA genes can be used to infer a phylogeny that distinguishes Archaea from Bacteria, but these sequences do not provide sufficient resolution of various Bacteria phyla.

# Figure 2. P-network of prokaryote genomes using !" with k=25, based on whole-genome data with rRNA genes removed, at t = 3. An edge between two nodes represents the number of connections between isolates from the two phyla. The size of a node is proportional to the number of isolates within the phylum. Singletons are not shown.

129

5.2.3.3 Evolution of plasmid genomes To compare the evolutionary histories of extrachromosomal plasmids against those of whole genomes, we computed I- and P-networks using plasmid-only genome data for 921 isolates from 26 phyla (see Methods). Figure 3 shows the I-network of the 921 plasmid genomes at t = 0, in which E = 14.3 (c = 745, e = 10679 and z = 48; Supplementary Table S5). Most phyla appear as distinct cliques, but notably with edges between the Actinobacteria, Firmicutes and the different classes of Proteobacteria. At t = 4 most phyla are separated as distinct cliques, with the exception of ε-Proteobacteria and Firmicutes; the other Proteobacteria (α, β, δ and γ) are in a distinct paraclique. The Euryarchaeota, connected only to the bacterial phylum of Planctomycetes at t = 0, is separated from Bacteria at t ≥ 1. All phyla are disjoint at t = 7. These results are not surprising, as the plasmid genomes are known to evolve faster than the core genomes, and in combination with their smaller genome size, fewer shared k-mers are observed at a high similarity threshold26.

130

# Figure 3. I-network of 921 plasmid genomes using !" with k=25. An edge between two nodes represents evidence of share k-mers.

Figure 4 shows E for all four I-networks at different thresholds. For all networks, the number of edges per node (thus network density) decreases as t increases; a higher proportion of shared k- mers is required at a higher (more-stringent) threshold. The rRNA gene-only network is denser than the others, i.e. E remains >1000 for t = 0 through t = 6, compared to E < 200 in the others for t = 0 through t = 4. As expected, the highest density of the complete-genome network is observed

131 at t < 2, E > 1400, and E decreases rapidly at t between 2 and 5. The network without rRNA genes has a lower density, E < 800, at t = 0 and decreases to level similar level to that of previous network at t = 5, e.g. E < 100. These results confirm that rRNAs are more highly conserved (i.e. the sequences are more similar as captured by 25-mers) than are the genome sequences overall. The density variation of the networks inferred based on whole-genomes and rRNA gene-removed data are more similar than the one observed for the rRNA-gene network. Figure 4 also shows that the plasmid network has the lowest density, E < 20 at t ≥ 0, implying that the plasmid genomes have diversified in 25-mer composition more rapidly than have the corresponding main genomes.

1600

1400 Plasmids No rRNAs 1200 rRNAs only 1000 Whole genome

800

600 Edges per node 400

200

0 0 1 2 3 4 5 6 7 8 9 Threshold

Figure 4. Number of edges per node, E, across distinct threshold levels of t for each I-network based on (a) complete genomes (core-genome with rRNAs + plasmids), (b) rRNA gene sequences, (c) complete genomes without rRNA genes, and (d) plasmid genomes.

5.2.3.4 Core k-mers of microbial genera We define a core k-mer in a group of interests as a k-mer that is present in every genome within the group, e.g. a core 25-mer in Proteobacteria is present in all proteobacterial genomes in our database (see Methods). Here, we identified core 25-mers for each genus in our 2785-genome dataset. Of these 699 genera, 497 consist of only a single genome isolate, and 51 consist of highly divergent genomes for which no core 25-mers were identified; we exclude these data from this part of analysis. The 151 genera for which core 25-mers were identified are shown in Supplementary Table S6. To represent the variable numbers of representative isolates of these genera in our dataset, we define K as the number of distinct core k-mers per isolate for each genus; this value can indicate the extent of genome divergence (and thus evolutionary rate of these genomes) for each of these genera. The three genomes of Azotobacter have the highest number of core k-mers, with K = 1722079; these genomes represent distinct isolates of the same species, Azotobacter vinelandii. This is in contrast to the 123 Streptococcus genomes (of 27 species) that

132 share only one core k-mer (K = 0.008). Among the 20 genera with the greatest K values, Shigella has the highest number of distinct isolates (10 isolates from four species) at K= 33698. This is in stark comparison to K = 4.82 among the 11 Ralstonia genome isolates from three species. Thus these Shigella genomes have diverged less from their common ancestor than have these Ralstonia genomes from theirs, as assessed by shared 25-mers.

5.2.3.5 Core functions of microbial phyla To relate the shared k-mers to biological functions, we assembled all 25-mers in the 2785 genomes and their associated genome locations and annotated function based on Clusters of Orthologous Groups (COGs27) in a relational database. Then using the core 25-mers above, we identified the core functions in each of the 151 genera based on annotated functions that are associated with these k-mers (e.g. using k-mer position information to identify the corresponding gene, or non- coding sequence, and the gene annotation when available). For this analysis, we focused on protein-coding sequences (i.e. rRNA sequences were discarded: Methods), resulting a set of core 25-mers from 112 genera in 15 phyla; the corresponding COG functional categories for these core 25-mers are shown in Supplementary Table S7. The non-informative functional categories R (General function prediction only) and S (Function unknown) were excluded in subsequent analysis. We do not identify any core k-mers related to the functional category Y (Nuclear structure) in our dataset. The less-represented functional categories in our data (those with proportion <1%) are A (RNA processing and modification), B (Chromatin structure and dynamics), W (Extracellular structure) and Z (Cytoskeleton). The , Euryarchaeota and Thaumarchaeota are the only phyla with evidence of core k-mers associated with the functional category B. The functional category A is related to core k-mers only in the proteobacterial classes (with the exception of the ε-Proteobacteria) and in phylum Actinobacteria. Figure 5 shows the proportions of COG number associated with core 25-mers across the 23 COG categories for 16 phyla, showing the top five categories for each phylum. Categories E (Amino acid metabolism and transport) and C (Energy production and conversion) are among the top five categories in 15 and 13 phyla respectively. The ε-Proteobacteria, Thaumarchaeota, Euryarchaeota, Actinobacteria, Cyanobacteria and Chloroflexi are also the only phyla with category H (Coenzyme metabolism) in the top five. For the phyla Tenericutes, Deinococcus-Thermus, Firmicutes and Crenarchaeota, the most represented functional categories include P (Inorganic ion transport and metabolism), L (Replication and repair), J (Translation), E and G (Carbohydrate metabolism and transport). The phylum Bacteroidetes is the only phylum for which categories O (Post-translational modification, protein turnover, chaperone functions), Q (Secondary structure) and F (Nucleotide metabolism and transport) are among the top five. Phylum is the only one with U (Intracellular

133 trafficking and secretion) and T (Signal transduction) in the top five, but the COG numbers associated with core 25-mers are extremely low.

In order to find whether the phyla can be clustered based on their COG categories profiles, we performed a series of PCA analysis. PCA on the raw data (e.g. non-normalised counts of COG number) did not show evidence of any particular clustering (Supplementary Figure S2). Nor do the genera cluster according to the number of isolates (Supplementary Figure S3). These results confirm that the different numbers of isolates per genus do not bias our analysis of functional categories. Supplementary Figure S4 shows the PCA analysis performed on the normalised counts of COG numbers with center-scaled COG categories (e.g. COG categories with equal weights). In this analysis Nitrosopumilus, the only genus in phylum Thaumarchaeota in this dataset, is isolated from the other genera. Genus Dehalococcoides, a member of phylum Chloroflexi, is likewise separated from the other genera by this measure.

# Figure 5. P-network of prokaryote genomes using !" with k=25, based on whole-genome data with rRNA genes removed, at t = 3. The nodes are pie-charts representing the COG-category profiles of each phylum. Each COG category is color-coded. Only the top five categories are displayed; in most cases the top 5 categories account for at least 50%.

134 5.2.3.6 Computational efficiency and scalability # To compute the !" distance between microbial genomes we used a modified version of our own 7 # implementation of the !" statistics . This newer version was used to compute the !" distance of two genomes at a time. Each pairwise distance can be computed independently, so we ran thousands of parallel jobs for each pairwise comparison for our different AF networks across a high-performance distributed-memory computing cluster. On average, it takes about 15 seconds # (wall time) to compute the !" distance between two microbial genomes (time can vary depending on the genome size) and 16 hours (wall time) in total using 1000 CPUs on a high-performance cluster. The principal advantage of this approach, as compared to our previous implementation of # !" , is that it is not limited by memory consumption, as each job requires only 2-3 GB. Although visualisation of the network using the D3 library is scalable to large data, it can take a few minutes for the force-directed algorithm to provide an optimal layout for a densely connected network.

Extraction of the core k-mers took less than an hour for our dataset of 2785 microbial genomes. Mapping the core k-mers of 1475 genomes to our SQL database (store on a SSD hard-drive) took less than one hour.

5.2.4 Discussion

In this study we demonstrate that AF approaches can be used to infer phylogenetic networks quickly and accurately for large-scale microbial whole-genome data. We introduce for the first time the concept of k-mer similarity network and two different types of AF networks, the I- and P-networks. We show that by combining a k-mer approach with the use of a relational database, biological information can be accessed for large-scale data at unprecedented speed. Finally, we define core k-mers as k-mers present in every isolate genome of a genus, following the concept of core genes28,29.

We examined the impact of rRNA genes and plasmids on the phylogenetic signal captured when computing phylogenomic relationships among microbial genomes. As expected, the rRNA genes contribute to the phylogenetic signal captured by 25-mers, as they do in MSA-based approaches. However, the pattern of network density versus threshold value (see Figure 4) indicates that the phylogenetic signal recovered here is not driven by rRNAs alone. Our result that, in general, these rRNA genes do not resolve relationships among (and sometimes within) is in line with many previous studies20,30-32. The density of the AF plasmid network confirms the large diversity of these mobile genetic elements, and we found similarity between the connections observed between this network and the ones based on whole-genomes, with or without rRNA genes, and rRNA gene sequences. The proteobacterial classes tend to have the strongest connections in all our AF networks, in particular between β- and γ-Proteobacteria, and we also 135 observed strong similarity between the Actinobacteria and Proteobacteria or Firmicutes across all networks. The large extent of LGT between β- and γ-Proteobacteria33 isolates partly explains this strong similarity in our AF networks.

Overall, we demonstrated that the I- and P-networks provide a quick overview of the evolutionary relationships among whole genomes, or subsets of genomes, in large-scale datasets. Moreover, our AF networks, based on 25-mers pairwise comparison between two isolates, can be used to study the evolutionary dynamics aggregated at different taxonomic levels: by varying the distance threshold we can visualise evolutionary patterns among kingdoms (e.g. Archaea and Bacteria at t < 3), phyla (e.g. Proteobacteria, Firmicutes etc. at 3 £ t £ 5), classes (e.g. of Proteobacteria 4 £ t £ 6), and between and/or within genera (e.g. Escherichia coli and Shigella at t > 6).

Our approach to find the most highly conserved functions (apart from those of rRNAs) using core 25-mer profiles has shown that the biological functions associated with the metabolism and transport of amino acids, and the production and conversion of energy, are the categories most conserved in our dataset. The core 25-mer profiles revealed that similar core biological functions profiles are observed for phyla that share a large extent of k-mers in our AF network. Our analysis also indicates that the functions highly conserved in ε-Proteobacteria and in δ-Proteobacteria are distinct from those conserved in the other proteobacterial classes. Except for the two most highly conserved categories (above), the ε-Proteobacteria do not share highly conserved functions with the other classes of Proteobacteria; indeed, the ε-Proteobacteria share more 25-mers with the Firmicutes and with the Actinobacteria than with other Proteobacteria. These results support previous findings showing that the ε-Proteobacteria are the most-basal Proteobacteria by most criteria, and the last class among in this phylum to have been recognised34. Finally, we also observed that the Tenericutes are among the only phyla that do not have highly conserved functions related to energy production and conversion; this can be related to their parasitic lifestyle35.

Of these 699 genera, no core 25-mers were recovered for 51, particularly in genera represented by genome sequences for many isolates from different species. For these, a core k-mer sets might be sought at lower values of k, although at the potential risk of capturing a phylogenetic signal due to false positives and background noise. Similarly, some phyla that we used to identify highly conserved functions have few distinct COGs related to core 25-mers.

A major advantage of AF approaches in general (and this approach in particular) lies in its computational performance in the inference of phylogenetic networks, and the extraction and mapping of core k-mers to biological functions7,36. Because our approach consists of independent pairwise comparisons we can distribute the computation across multiple processors, greatly

136 minimizing problems potentially arising due to demand on memory7. Here we inferred 25-mer similarity networks among < 2700 genomes in a matter of hours (see Methods). To map core k- mers to our database we took advantage of the SQL architecture, indexing and hashing to compare billions of k-mers in a few minutes using an SSD hard drive. The database itself could be generated in only a few hours from RefSeq data of more than 4000 microbial isolates.

It would be of great interest to be able to discriminate the edges in the AF networks based on the dominant phylogenetic signal observed (e.g. vertical versus lateral). To visualise large phylogenetic network such as the one presented here, the D3 library (and web technology more generally) might not be the most optimal approach. Indeed, even with recent improvements of the JavaScript-based application and an optimised library such as D3, it is difficult for web browsers to display large networks in a force-directed layout. We could use instead use software specifically designed for visualisation of large networks, e.g. Gephi37, although undoubtedly at the expense of accessibility e.g. through unfamiliarity among users, or loss of cross-browser compatibility. Finally, we understand that an open-access, publicly available version of a k-mer database would be useful for our research community; however, such a database would require dedicated servers, management and support to be durable and useful at long-term for the community.

5.2.5 Methods

Data. The 2785 completely sequenced genomes of Bacteria and Archaea were downloaded from NCBI on 31 January 2016 (Supplementary Table S1); functional annotation of these genomes was obtained through the corresponding RefSeq records. Genes encoding ribosomal RNAs were identified based on annotation. Genomes with no annotation information were excluded from our rRNA genes network. Among the 2785 isolates, 921 contains plasmids; these plasmid genomes were used in the plasmid-only network.

Relational database of k-mers and genome features. We extracted 10,059,526,408 distinct 25- mers from the genomes of 4401 bacterial and archaeal isolates (as of 31 of January 2016 in NCBI RefSeq), of which 2781 genomes are complete. We tabulated these k-mers, and their genomic locations and features (based on RefSeq annotations), in a relational database using SQL, following Greenfield and Roehm36. Tables in this database contain the list of isolates, the list of genes and their sequences, taxonomic information for each isolate, an indexed list of all 25-mers, an indexed list of gene-by-gene comparisons for each pair of genes, and an indexed list of genome- by-genome comparisons for each pair of genomes. The relational model is shown in Supplementary Figure S5.

137 AF network. We followed Bernard et al.18 in generating the AF networks. First we computed pairwise comparisons for the 2785 isolates and generated for each comparison the corresponding # 7 !" distance d, using 25-mers across parallel CPUs. For each pair of genomes a and b, we transformed d into a similarity measure Sab, where '%& = 10 − $. We computed the fraction of shared 25-mers for a small sample of isolates from the 2785 genomes used in this study and # # compared these fractions with the corresponding !" values. The relationship between !" and the fraction of shared 25-mers is shown in Supplementary Table S8. We discarded all instances for which $ >10, as these pairs of sequences share ≤ 0.01% of 25-mers (i.e. there is almost no evidence of homology). We then generated the networks using JSON files containing the S values as input for a Javascript script using the D3 library (https://d3js.org/). Here, we present two types of AF networks. For a phylum-level depiction of the network (P-network) we grouped all sequences of the same phylum as a single entity prior to calculating the distance; each phylum is represented by a node in the network. The width of the edge between two nodes represents the number of connections between isolates from these two phyla, and the size of each node is proportional to the number of isolates in the phylum. For an isolate-level depiction of the network (I-network) we treated each genome isolate as a single entity (i.e. node). In this network, an edge between two nodes indicates evidence of shared k-mers. The AF networks include a similarity-score threshold t, for which only edges with ' > ) are displayed; changing t therefore can dynamically change the structure of the network18. The resulting dynamic networks can be visualized using any web browser. All the networks can be found here: http://espace.library.uq.edu.au/view/UQ:54303725.

Core k-mers and COG categories. For a specific group of microbial isolates (e.g. a genus, or a phylum) we extracted the set of 25-mers that are found in all isolates within the group; we define this set of 25-mers as the core k-mers for the corresponding group. Using the relational database of k-mers (above), for these core 25-mers we identified their corresponding genome locations and function based on COG (Clusters of Orthologous Groups)38 annotation in RefSeq records. We generated profiles of COG functional categories for each of the 151 genera, for each of the 11 phyla, and for the five proteobacterial classes in which core k-mers are identified using our approach.

Computational scalability and runtime. Assessment of computational scalability was carried out using a high-performance distributed-memory computing cluster based on Intel Xeon Haswell (3.1 GHz) cores.

138 5.2.6 References

1 Chan, C. X. & Ragan, M. A. Next-generation phylogenomics. Biol Direct 8, 3, doi:10.1186/1745-6150-8-3 (2013).

2 Chan, C. X., Bernard, G., Poirion, O., Hogan, J. M. & Ragan, M. A. Inferring phylogenies of evolving sequences without multiple sequence alignment. Sci Rep 4, 6504, doi:10.1038/srep06504 (2014).

3 Haubold, B. Alignment-free phylogenetics and population genetics. Brief Bioinform 15, 407-418, doi:10.1093/bib/bbt083 (2014).

4 Ali, W., Rito, T., Reinert, G., Sun, F. & Deane, C. M. Alignment-free protein interaction network comparison. Bioinformatics 30, i430-i437, doi:10.1093/bioinformatics/btu447 (2014).

5 Sims, G. E., Jun, S. R., Wu, G. A. & Kim, S. H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Natl Acad Sci U S A 106, 2677- 2682, doi:10.1073/pnas.0813249106 (2009).

6 Sims, G. E. & Kim, S. H. Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs). Proc Natl Acad Sci U S A 108, 8329-8334, doi:10.1073/pnas.1105168108 (2011).

7 Bernard, G., Chan, C. X. & Ragan, M. A. Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer. Sci Rep 6, 28970, doi:10.1038/srep28970 (2016).

8 Song, K. et al. New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinform 15, 343-353, doi:10.1093/bib/bbt067 (2014).

9 Yi, H. & Jin, L. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res 41, e75, doi:10.1093/nar/gkt003 (2013).

10 Wang, H., Xu, Z., Gao, L. & Hao, B. A fungal phylogeny based on 82 complete genomes using the composition vector method. BMC Evol Biol 9, 195, doi:10.1186/1471-2148-9-195 (2009).

11 Leimeister, C. A. & Morgenstern, B. Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics 30, 2000-2008, doi:10.1093/bioinformatics/btu331 (2014).

12 Ulitsky, I., Burstein, D., Tuller, T. & Chor, B. The average common substring approach to phylogenomic reconstruction. J Comput Biol 13, 336-350, doi:10.1089/cmb.2006.13.336 (2006). 139 13 Bapteste, E. et al. Prokaryotic evolution and the tree of life are two different things. Biol Direct 4, 34, doi:10.1186/1745-6150-4-34 (2009).

14 Cong, Y., Chan, Y. B., Phillips, C. A., Langston, M. A. & Ragan, M. A. Robust Inference of Genetic Exchange Communities from Microbial Genomes Using TF-IDF. Front Microbiol 8, 21, doi:10.3389/fmicb.2017.00021 (2017).

15 Corel, E., Lopez, P., Meheust, R. & Bapteste, E. Network-Thinking: Graphs to Analyze Microbial Complexity and Evolution. Trends Microbiol 24, 224-237, doi:10.1016/j.tim.2015.12.003 (2016).

16 Cheng, S. et al. Sequence similarity network reveals the imprints of major diversification events in the evolution of microbial life. Frontiers in Ecology and Evolution 2, doi:10.3389/fevo.2014.00072 (2014).

17 Tatusova, T. A. & Madden, T. L. BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS microbiology letters 174, 247-250 (1999).

18 Bernard, G., Ragan, M. A. & Chan, C. X. Recapitulating phylogenies using k-mers: from trees to networks [version 2; referees: 2 approved]. F1000Research 5, 2789, doi:10.12688/f1000research.10225.2 (2016).

19 Hug, L. A. et al. A new view of the tree of life. Nat Microbiol 1, 16048, doi:10.1038/nmicrobiol.2016.48 (2016).

20 Dagan, T. & Martin, W. The tree of one percent. Genome Biol 7, 118, doi:10.1186/gb- 2006-7-10-118 (2006).

21 Skippington, E. & Ragan, M. A. Within-species lateral genetic transfer and the evolution of transcriptional regulation in Escherichia coli and Shigella. BMC Genomics 12, 532, doi:10.1186/1471-2164-12-532 (2011).

22 Schluter, A. et al. Erythromycin resistance-conferring plasmid pRSB105, isolated from a sewage treatment plant, harbors a new macrolide resistance determinant, an integron-containing Tn402-like element, and a large region of unknown function. Appl Environ Microbiol 73, 1952- 1960, doi:10.1128/AEM.02159-06 (2007).

23 Barlow, M. What antimicrobial resistance has taught us about horizontal gene transfer. Methods Mol Biol 532, 397-411, doi:10.1007/978-1-60327-853-9_23 (2009).

24 Ventola, C. L. The antibiotic resistance crisis: part 1: causes and threats. P T 40, 277-283 (2015).

25 Bernard, G. AF networks of 2785 microbial genomes. doi:10.14264/uql.2017.436 (2017). 140 26 Fondi, M. & Fani, R. The horizontal flow of the plasmid resistome: clues from inter- generic similarity networks. Environ Microbiol 12, 3228-3242, doi:10.1111/j.1462- 2920.2010.02295.x (2010).

27 Tatusov, R. L., Galperin, M. Y., Natale, D. A. & Koonin, E. V. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic acids research 28, 33-36 (2000).

28 Lerat, E., Daubin, V. & Moran, N. A. From gene trees to organismal phylogeny in prokaryotes: the case of the gamma-Proteobacteria. PLoS Biol 1, E19, doi:10.1371/journal.pbio.0000019 (2003).

29 Daubin, V., Gouy, M. & Perriere, G. A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history. Genome Res 12, 1080-1090, doi:10.1101/gr.187002 (2002).

30 Woese, C. R. Bacterial evolution. Microbiol Rev 51, 221-271 (1987).

31 Pace, N. R. Mapping the tree of life: progress and prospects. Microbiol Mol Biol Rev 73, 565-576, doi:10.1128/MMBR.00033-09 (2009).

32 Forterre, P. The universal tree of life: an update. Front Microbiol 6, 717, doi:10.3389/fmicb.2015.00717 (2015).

33 Beiko, R. G., Harlow, T. J. & Ragan, M. A. Highways of gene sharing in prokaryotes. Proc Natl Acad Sci U S A 102, 14332-14337, doi:10.1073/pnas.0504068102 (2005).

34 Trust, T. J. et al. Phylogenetic and molecular characterization of a 23S rRNA gene positions the genus Campylobacter in the epsilon subdivision of the Proteobacteria and shows that the presence of transcribed spacers is common in Campylobacter spp. J Bacteriol 176, 4597-4609 (1994).

35 Skennerton, C. T. et al. Phylogenomic analysis of Candidatus 'Izimaplasma' species: free- living representatives from a Tenericutes clade found in methane seeps. ISME J 10, 2679-2692, doi:10.1038/ismej.2016.55 (2016).

36 Greenfield, P. & Roehm, U. Answering biological questions by querying k-mer databases. Concurrency and Computation: Practice and Experience 25, 497-509, doi:10.1002/cpe.2938 (2013).

37 Bastian, M., Heymann, S. & Jacomy, M. Gephi: an open source software for exploring and manipulating networks. ICWSM 8, 361-362 (2009).

141 38 Powell, S. et al. eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Research 40, D284-D289, doi:10.1093/nar/gkr1060 (2012).

5.3 Concluding remarks

This chapter demonstrates that AF approaches can be used to infer phylogenomic networks of thousands of microbial genomes. Here I introduced for the first time the concept of k-mer similarity network and two different types of AF networks, the I- and P-networks. I examined the impact of rRNA genes and plasmids on the phylogenetic signal captured when computing phylogenomic relationships among microbial genomes. I found that the AF networks, based on 25-mers pairwise comparison between two isolates, can be used to study the evolutionary dynamics aggregated at different taxonomic levels. I also defined the concept of core k-mers as k-mers present in every isolate genome of a genus and found that the biological functions related to Amino acid metabolism and transport, and Energy production and conversion, are the most conserved categories across the phyla tested. Overall, I showed that by combining a k-mer approach with the use of a relational database, biological information can be accessed for large-scale data at unprecedented speed.

142 CHAPTER 6: GENERAL DISCUSSION

The development of next-generation sequencing technologies provides an opportunity to study evolution of organisms using whole-genome information, gain a different perspective into microbial evolution and better understand of the complex phenomena implicated. Over the past two decades, AF phylogenetic approaches have been developed to infer phylogenies of large data in a scalable manner. AF methods decompose the sequences into fragments of a fixed length, e.g. k-mers, and compare those fragments. Many AF methods compute a distance matrix of pairwise distances between sequences by comparing their k-mers distributions1, while others look for the longest common k-mers between two sequences2. The AF phylogenetic tree is inferred by using the resulting distance matrix as input into a distance method such as neighbour-joining3. The AF methods currently available have shown good computational performance in term of speed, memory consumption and accuracy4-6. These methods offer a promising alternative to the current classical phylogenetic approaches based on MSA, but until the work reported herein their robustness to specific evolutionary events and their scalability had not been systematically investigated.

In this thesis, I focused on phylogenomic inference of microbial genomes using AF approaches. Microbial evolution has always been a difficult subject to study, due to the complex evolutionary events involved. LGT, rampant among microbial genomes7, and genomic rearrangement8 violate the assumption of full-length contiguity that underlies MSA-based approach. AF approaches do not compare sequences using the positional information of k-mers, and could be robust against these evolutionary events. To better address these problems, I studied and systematically assessed the performance of these AF approaches to infer phylogenomic relationships of evolving sequences and complete microbial genomes. I also developed an extended AF approach to infer phylogenetic networks and quickly extract biological information based on k-mer position.

In this chapter I summarise the key findings of the thesis and discuss possible future directions and extensions of this work.

143 6.1 Phylogenetic inference of evolving sequences without multiple sequence alignment

In Chapter 3, I systematically assessed the performance of a set of AF methods, the !" statistics, in phylogenetic inference of simulated and empirical DNA and protein sequences. I chose the !" statistics, which are based on a statistical comparison of word counts, because these methods have 6,9 shown promising performance in previous studies . I also developed a new version of the !" 0 statistics, !" , based on comparison of k-mer neighbourhood, in order to minimise the impact of background noise. I simulated evolving sequences under various biological scenarios and compared the phylogenies inferred by the AF and MSA-based approaches. My findings demonstrated that AF methods, compared to the current standard based on MSA, are more robust against among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions, but is more sensitive to sequence divergence and the presence of incomplete (truncated) sequence data.

The AF approach implemented in this study appeared to have no difficulty, at appropriate parameter settings across my simulated datasets, in capturing homology signal and generating topologies that are very similar or identical to those generated by MSA followed by Bayesian inference, arguably the current standard in phylogenetics. The robustness of AF methods to rearrangements and insertions/deletions represents a critical advantage, since these events are common among microbial genomes8 and frequently interrupt individual genes10.This issue is well understood and accepted by the authors of MSA methods for bacterial genomes8.

Using extensive simulated data and diverse empirical data, my results consistently demonstrated the relative accuracy and scalability of AF methods in large-scale phylogenetic inference, regardless of which specific method they were compared against. The empirical datasets used in this study were highly diverse, with various extents of within-set sequence divergence and data sizes. Many of these sequence sets contained partial and/or fragmented sequences. As in the analysis of simulated sequence sets, these aspects impact the accuracy of AF methods more than that of MSA-based approaches in recovering accurate phylogenies.

In general, my results demonstrated the utility and robustness of the different AF methods used. The non-monotonic relationship between word length and performance, the utility of the different !" statistics and the failure of larger mismatch neighbourhoods are broadly consistent with previous 6,11 reports . However, simple !" scoring is known to be dominated by single-sequence noise effects as k increases6; its good performance here may in part be explained by the normalisation inherent in my distance measure.

I also found that the AF methods offered computational speed in orders of magnitude faster than the comparable MSA-based approaches, with memory requirements in the hundreds of megabytes, well 144 within the capabilities of any standard computer. To the extent that memory is not an issue, AF methods are a potential candidate as highly scalable alternative to MSA-based methods in large-scale phylogenomics analyses.

6.2 Microbial phylogenomics using alignment-free approaches

In Chapter 4, I studied the performance of nine AF methods in inferring phylogenomic relationships among microbial genomes. These methods represent the two AF families, those based on word counts and those based on match lengths. At the genome scale, more-complex evolutionary phenomena interfere with the vertical inheritance signal. I simulated bacterial-like genome datasets to systematically assess the robustness of the AF approaches to sequence divergence, LGT and rearrangement. I found that AF phylogenetic approaches can be used to quickly and accurately infer phylogenomic relationships among microbes using whole-genome data. I also introduced a method, based on the jackknife, to provide node-support values in phylogenetic trees constructed using AF approaches.

In general, the methods based on word count outperformed match-length methods. All these methods performed well on highly similar sequence data, but the methods based on word count proved to be more robust as the input data became more divergent. All these AF methods proved robust against moderate amount of LGT, antiquity of LGT and genome-scale rearrangement, extending my findings in Chapter 3 based on analysis of gene-scale sequence data12. My results also show that the parameters of AF methods are sensitive to sequence length and genome divergence.

The jackknife approach was simple to apply without prior sequence alignment. Here I deleted 40% of the data because 40% was previously shown to provide a reasonable balance between the generation of useful replicates and loss of phylogenetic signal in sequence data13,14. The resulting JK support values appear to be biologically meaningful, as recognised taxa were often strongly supported. Particularly in the 143-genome dataset, JK support tended to decrease with increasing sequence divergence within-group, as is also seen with bootstrap support and Bayesian posterior probability in MSA-based studies.

# The AF tree based on 143-genome dataset, inferred using the !" method, recovered 13 out of the 15 backbone nodes found in a previous study7. Likewise, AF trees based on the 27-genome E.Coli and Shigella dataset recovered the ECOR reference groups15. For the Yersinia dataset, all AF-based methods recovered only two types of topology, one of which reflects the extent of genome rearrangement among these strains. Given the intricate evolution of these taxa and the fact that I cannot recreate history, the true phylogeny in these instances remains an open question.

145 Among the AF methods I tested, methods based on word counts can be orders of magnitude faster than those based on match lengths, but tend to be more memory-intensive and more sensitive to parameter settings. However, it has been proposed16 that filtering out non-informative k-mers can be a useful approach to reducing memory requirement.

This study makes it clear that AF approaches present exciting alternatives in phylogenetic inference for large sets of microbial-sized genomes at different phyletic breadth, even in the presence of genomic rearrangement and LGT.

6.3 An extended AF approach to study microbial evolution

In Chapter 5, I demonstrated that AF approaches can be used to infer phylogenetic networks quickly and accurately for large-scale microbial whole-genome data. I introduced for the first time the concept of k-mer similarity network and two different types of AF networks, the I- and P-networks. I showed that by combining a k-mer approach with the use of a relational database, biological information can be accessed for large-scale data at unprecedented speed. I also defined the concept of core k-mers as k-mers present in every isolate genome of a genus, following the concept of core genes17,18.

I examined the impact of rRNA genes and plasmids on the phylogenetic signal captured when computing phylogenomic relationships among microbial genomes. I found that the rRNA genes contribute to the phylogenetic signal captured by 25-mers, and that the phylogenetic signal recovered in the whole-genome network is not driven by these genes alone. The density of the AF plasmid network confirmed the large diversity of these mobile genetic elements, and I found similarity between the connections observed between this network and those based on whole genomes, with or without rRNAs genes, and on rRNA gene sequences. Overall, I demonstrated that the I- and P- networks provide a quick overview of the evolutionary relationships among whole genomes, or subsets of genomes, in large-scale datasets. Moreover, my findings suggest that these AF networks, based on 25-mers pairwise comparison between two isolates, can be used to study the evolutionary dynamics aggregated at different taxonomic levels.

My approach to find the most-highly conserved functions (apart from those of rRNAs) using core 25- mer profiles revealed that the biological functions related to Amino acid metabolism and transport, and Energy production and conversion, are the most conserved categories across the phyla tested. My results also supported previous findings showing that the ε-Proteobacteria are the most-basal Proteobacteria by most criteria19, and that the Tenericutes do not have highly conserved functions related to energy production and conversion, consistent with their parasitic lifestyle20.

146 6.4 Conclusion and future directions

In just a few years AF approaches have demonstrated great performance in phylogenomic analysis. These approaches can generate accurate phylogenetic relationships among microbial organisms using DNA/protein sequences and whole-genome data, at unprecedented speed. By comparing only k-mer profiles between sequences, the AF methods are more scalable than their MSA-based counterpart without the computational overhead of a complex substitution model (or multiple such models for different sequence regions). For most AF methods, an appropriate selection of the k-mer length is the only parameter needed.

In this study I demonstrated that a jackknife approach can provide biologically meaningful node support value for the AF trees; however, the unit of the branch length remains to be determined. Indeed, in most phylogenetic tree inferred using MSA-based approach the branch length represents the number of substitutions per site, but in the AF trees the unit of branch length has yet to be defined.

This work has demonstrated that most available AF methods provide similar accuracy in most biological scenarios tested, assuming optimal parameter values are used. However, their scalability varies significantly. Word-count methods perform better than their match-length counterparts in terms of accuracy and scalability, and these methods tend to be a better choice for phylogenomic analysis. Regarding the word-count methods, I found that k-mer length is optimal when it allows for sufficient k-mer uniqueness, which in turn depends on three major components: length of the sequences, alphabet (4 for amino acids and 20 for DNA) and divergence of the data. The optimal size of k can significantly change the phylogenies inferred from short sequences, but has less impact when using whole genomes. A k-mer length between 22-25 often provides optimal k-mer uniqueness and leads to accurate phylogenies using whole-genome data.

My findings suggest that a k-mer length of 25 can be used as golden parameter, and word count methods as standard methods for AF phylogenomics analysis of whole-genome data. Moreover, instead of computing a phylogenetic tree or network for each dataset or problem, it could be useful to create an open-access database where all the AF pairwise comparisons between two organisms, and extensive biological information related to the k-mers, would be stored and indexed.

Finally, I have shown that an extended AF approach can be used to infer phylogenetic networks of whole-genome data using k-mers as the basic unit. However, the nature of the phylogenetic signal (e.g. vertical or lateral) captured by the k-mers remains unknown. It would be of great interest to be able to discriminate the edges in the AF networks based on the dominant phylogenetic signal observed (e.g. vertical versus lateral). A new type of network, allowing two types of edges representing vertical and lateral inheritance signal, could be the next step for AF phylogenomics.

147 Overall, in this thesis, I have provided a comprehensive review of the different AF approaches available in the context of phylogenomic analysis, systematically assessed their robustness and scalability, and demonstrated their performance in phylogenomic inference of microbial genomes. An extended AF approach could lead to the era of “Next-Generation Phylogenomics”21.

6.5 References

1 Wang, H., Xu, Z., Gao, L. & Hao, B. A fungal phylogeny based on 82 complete genomes using the composition vector method. BMC Evol Biol 9, 195, doi:10.1186/1471-2148-9-195 (2009).

2 Ulitsky, I., Burstein, D., Tuller, T. & Chor, B. The average common substring approach to phylogenomic reconstruction. J Comput Biol 13, 336-350, doi:10.1089/cmb.2006.13.336 (2006).

3 Saitou, N. & Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4, 406-425 (1987).

4 Haubold, B. Alignment-free phylogenetics and population genetics. Brief Bioinform 15, 407-418, doi:10.1093/bib/bbt083 (2014).

5 Sims, G. E., Jun, S. R., Wu, G. A. & Kim, S. H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Natl Acad Sci U S A 106, 2677-2682, doi:10.1073/pnas.0813249106 (2009).

6 Reinert, G., Chew, D., Sun, F. & Waterman, M. S. Alignment-free sequence comparison (I): statistics and power. J Comput Biol 16, 1615-1634, doi:10.1089/cmb.2009.0198 (2009).

7 Beiko, R. G., Harlow, T. J. & Ragan, M. A. Highways of gene sharing in prokaryotes. Proc Natl Acad Sci U S A 102, 14332-14337, doi:10.1073/pnas.0504068102 (2005).

8 Darling, A. E., Miklós, I. & Ragan, M. A. Dynamics of genome rearrangement in bacterial populations. PLoS Genet 4, e1000128, doi:10.1371/journal.pgen.1000128 (2008).

9 Wan, L., Reinert, G., Sun, F. & Waterman, M. S. Alignment-free sequence comparison (II): theoretical power of comparison statistics. J Comput Biol 17, 1467-1490, doi:10.1089/cmb.2010.0056 (2010).

10 Chan, C. X., Darling, A. E., Beiko, R. G. & Ragan, M. A. Are protein domains modules of lateral genetic transfer? PLoS ONE 4, e4524, doi:10.1371/journal.pone.0004524 (2009).

11 Burden, C. J., Kantorovitz, M. R. & Wilson, S. R. Approximate word matches between two random sequences. Ann. Appl. Probab. 18, 1-21, doi:Doi 10.1214/07-Aap452 (2008).

12 Chan, C. X., Bernard, G., Poirion, O., Hogan, J. M. & Ragan, M. A. Inferring phylogenies of evolving sequences without multiple sequence alignment. Sci Rep 4, 6504, doi:10.1038/srep06504 (2014).

13 Farris, J. S., Albert, V. A., Källersjö, M., Lipscomb, D. & Kluge, A. G. Parsimony jackknifing outperforms neighbor-joining. Cladistics 12, 99-124 (1996).

148 14 Shi, J., Zhang, Y., Luo, H. & Tang, J. Using jackknife to assess the quality of gene order phylogenies. BMC Bioinformatics 11, 168, doi:10.1186/1471-2105-11-168 (2010).

15 Skippington, E. & Ragan, M. A. Within-species lateral genetic transfer and the evolution of transcriptional regulation in Escherichia coli and Shigella. BMC Genomics 12, 532, doi:10.1186/1471- 2164-12-532 (2011).

16 Gunasinghe, U., Alahakoon, D. & Bedingfield, S. Extraction of high quality k-words for alignment- free sequence comparison. J Theor Biol 358, 31-51, doi:10.1016/j.jtbi.2014.05.016 (2014).

17 Lerat, E., Daubin, V. & Moran, N. A. From gene trees to organismal phylogeny in prokaryotes: the case of the gamma-Proteobacteria. PLoS Biol 1, E19, doi:10.1371/journal.pbio.0000019 (2003).

18 Daubin, V., Gouy, M. & Perriere, G. A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history. Genome Res 12, 1080-1090, doi:10.1101/gr.187002 (2002).

19 Trust, T. J. et al. Phylogenetic and molecular characterization of a 23S rRNA gene positions the genus Campylobacter in the epsilon subdivision of the Proteobacteria and shows that the presence of transcribed spacers is common in Campylobacter spp. J Bacteriol 176, 4597-4609 (1994).

20 Skennerton, C. T. et al. Phylogenomic analysis of Candidatus 'Izimaplasma' species: free-living representatives from a Tenericutes clade found in methane seeps. ISME J 10, 2679-2692, doi:10.1038/ismej.2016.55 (2016).

21 Chan, C. X. & Ragan, M. A. Next-generation phylogenomics. Biol Direct 8, 3, doi:10.1186/1745- 6150-8-3 (2013).

149 APPENDIX A

Inferring phylogenies of evolving sequences without multiple sequence alignment

Supplementary Information

Description of statistics used in this study

1 The !" statistics is defined as the count of exact word (w) matches of length k shared between two sequences. Given alphabet A, there are Ak number of possible words of length k. Given two sequences,

X and Y, the !" score is defined as:

!" = $%&% %∈() in which $% is the count of word w in sequence X, &% is the count of word w in sequence Y.

Other extensions of this statistic have been developed based on different ways of normalisation. In * ∗ , this study we used !" and !", and we introduce !" .

* 2 !" is based on Shepp’s statistic , in which a !" score is normalised based on probability of occurrences of specific k-mer in the sequence3, 4. Given n and m the number of all possible w (k- . . mers) respectively in sequences X and Y, and that -% and -% the probability of a specific w (k-mer) respectively in sequences X and Y, $% and &% can be normalised as:

. 2 $% = $% − 0-% and &% = &% − 1-%

* !" is then defined as:

* $%&% !" = " " %∈() $% + &%

∗ Similarly, !" is based on the postulation that number of occurrences of word w (k-mer) is approximately Poisson, therefore its mean and variance are approximately the same for long word w3, 4.

∗ $%&% !" = 01. -.. -2 %∈() % %

5 For detailed description and review of !" statistics, see Song et al.

, In this study, we also introduce !" , a !" statistic that expands, for each word w recovered in the sequences, to its neighbourhood n, i.e. all possible k-mers with n number of wildcard residue(s), relative to w. For example, consider n = 1, k = 3 in a nucleotide sequence. Given w = AAA recovered

150 from sequence X, the neighbourhood for AAA includes AAA, AAC, AAT, AAG, ACA, ATA, AGA, CAA, TAA, GAA; all of these are recorded as k-mers found in sequence X. Similarly, all k-mers within the neighbourhood for each w found in sequence X are recorded, likewise for those found in , 5 5 sequence Y. !" is a simple extension from the original !" metric. Let $% and &% be the number of , occurrences of word w in the n-neighbourhood respectively for sequences X and Y, !" is defined as:

, 5 5 !" = $% &% %∈()

Computational complexity of 67 statistics

The computation time of !" largely scales linearly with the number of sequences (N) and average sequence length (L) during k-mer retrieval and quadratically with N during the score computation, ∗ * , compared to quadratic scaling with N for !" and !" (ref. 6). Our !" has a complexity of O(N(L + Akn2) + N2Ak), where A is alphabet size (i.e. A = 4 and 20 respectively for nucleotides and proteins), k is word (k-mer) length and n is neighbourhood, sharing a similar complexity with the published method N2 (ref. 6). This is in comparison to generally quadratic and cubic complexity for MSA and phylogenetic distance methods on trees7, 8.

Analysis of among-site rate heterogeneity

We assessed the sensitivity of !" methods to among-site rate heterogeneity in the sequences, by varying the α value of the gamma distribution under which we simulated the sequence sets9. A small α value (≤ 1) implies high heterogeneity of substitution rates across sites, i.e. some number of sites have diversified quickly while the others little or not at all, analogous to having multiple conserved domains within a protein sequence, or the corresponding regions (domons10) within a nucleotide sequence. On the other hand, a large α (> 1) indicates low among-site rate heterogeneity, i.e. that most sites have diversified at about the same rate.

We simulated sequence sets under T2 (Fig. 1) across different α values in the 8-category discrete gamma distribution, either with a uniform α value set across all tree branches, or with a mixture of two α values set for each half of the tree branches. The latter cases simulate a combination of both high- and low-level rate heterogeneity within a set of sequences. We progressively set α = 0.5, 1.0 and 2.0 under the same 8-category discrete gamma distribution9. For sets containing sequences of varied rate heterogeneity, each half of the sequences was first simulated under a single α value of gamma distribution before being combined together into a single set.

Supplementary Fig. S6 shows the mean Q for each !" method for cases with (i) a uniform α = 0.01, (ii) α = 0.5 and 1.5 each for one-half of the sequences, and (iii) a uniform α = 5.0, for each size 151 category N of sequence set, at the optimal k = 8 and 4 respectively for nucleotides and proteins. Complete results across all categories and k-mer lengths are shown in Supplementary Fig. S7 and S8 respectively for nucleotide and protein sequence sets. For nucleotide sequence sets simulated at low

(uniform) α = 0.01, !" methods at k = 8 on average performed better (mean Q > 0) than the standard approach, particularly at N = 32 and 128 (e.g. QD2n1 = 0.007, 0.082 and 0.092 for N = 8, 32 and 128). In the other cases, both approaches yielded almost identical trees across all N categories (all Q = 0 except in the cases of mixed α = 0.1 or 1.5 at N = 128, in which Q = -0.00008 for each !" method). Similar observations are noted for the protein sequence sets across the different levels of among-site rate heterogeneity, e.g. mean QD2n1 = 0.052, 0.096 and 0.108 for N = 8, 32 and 128 in cases of uniform ∗ α = 0.1; Q approximates 0 in the other cases. Interestingly !", in contrast to other !" methods, appears to perform consistently worse than MSA across all cases of (i), (ii) and (iii) with observed negative

Q values, e.g. at N = 128, mean QD2* = -0.042, -0.007 and -0.014 for cases in (i), (ii) and (iii). In general, among-site rate variation does not appear to affect drastically the accuracy of either !" or

MSA-based approaches (Q = 0 in most cases in Supplementary Fig. S6). In fact, !" methods at the appropriate k-mer length largely perform as well as, if not better than, the MSA standard with this dataset (Q ≥ 0 in Supplementary Fig. S6).

Analysis of compositional biases

For this part of analysis, we simulated sequence sets across different levels of G+C content. For protein sequence data, nucleotide sequence sets were first generated using the specific G+C proportion and subsequently translated into amino acids using the standard genetic code. We set the G+C proportion progressively at 0.5, 0.7 and 0.9, with fixed α = 1.0. As in an earlier study11, we considered only cases of G+C proportion ≥ 0.5 because the effects of nucleotide content bias on trees are symmetrical around G+C 0.5, e.g. G+C 0.8 and G+C 0.2 show identical effects in our analyses; the corresponding biases at the amino-acid level are symmetric enough for our purposes here. For protein sequence sets, the nucleotide sequences were translated in frame +1. For stop codons (TAG, TAA and TGA), if present, the thymine residue was arbitrarily replaced with adenine to avoid interruption of protein translation. All sequences were simulated under tree T2 (Fig. 1).

Supplementary Fig. S9 shows the mean RF for each !" method for cases with G+C proportion (i) 0.5, (ii) 0.7 and (iii) 0.9, for each size category N of sequence sets (at k = 8 and 4 respectively for nucleotides and proteins). For both nucleotide and protein sequence sets in all size categories, both

RFD2 and RFD2S are < 0.003 across different G+C proportions. Similarly, RFD2n1 values across these data are largely zero except for cases at the extreme G+C proportion 0.9, e.g. RFD2n1 = 0.021 and 0.024 respectively for nucleotide and protein sets of N = 128. The high G+C proportion (thus low complexity of sequences) plays to the strength of local exact matches, rather than neighbourhood

152 ,89 (non-exact) matches as allowed in !" . At G+C proportion 0.50 (no G+C bias), all methods ∗ perfectly recovered the reference topology across all N for nucleotide sequence sets. Interestingly, !" appears to be more sensitive (and to perform drastically worse) than the other !" methods as G+C proportion and size of sequence sets increase, e.g. RFD2* = 0.12 and 0.19 respectively for protein sequence sets of N = 32 and 128 that are generated at nucleotide G+C proportion 0.9.

Analysis of other scenarios of insertions/deletions

To assess the sensitivity of the alignment-free approach specifically to insertions/deletions, we imposed post-hoc deletions within a set, in half of the sequences such that each sequence has three regions of length : deleted at random but non-overlapping positions. Supplementary Fig. S12 shows the Q value obtained using each !" method across cases at varied length :, respectively for nucleotide (Fig. S12a) and protein (Fig. S12b) sequence sets. Our results suggest that alignment-free methods are more sensitive to deletion than the MSA-based approach, with observed Q values decreasing with increasing extent of deletion. For instance, QD2n1 = -0.016, -0.030 and -0.235 at : = 10nt, 50nt and 150nt. At : = 150nt, the total deleted region in a nucleotide sequence is 450nt (30% of the full length). Deletion reduces the number of k-mers available for inference of evolutionary relationships, so this result is perhaps not surprising. For the protein sequence sets the negative impact of deletion is ∗ greater, with QD2n1 = -0.053, -0.090 and -0.602 at : = 3aa, 15aa and 50aa. Interestingly !", normalised based on the probability of occurrences of each distinct k-mer, again fared worse than the other !" methods across the protein sequence sets, with observed Q = -0.650 at : = 50aa (total deletion of 150aa; 30% of full length). These findings suggest that the MSA-based approach is more robust to unstructured insertions/deletions than is the alignment-free approach. The use of MUSCLE or MAFFT does not contribute to significant difference to our results (p > 0.5 observed for both nucleotide and protein datasets in two-sided paired Student’s T-test; Supplementary Table S6).

At the extreme end of spectrum within the MSA context, staggering of deletions across a sequence set, although biologically unrealistic, is recognised as a worst-case scenario for phylogenetic inference12. Even short deletions arrayed in this way cause gaps to be placed incorrectly in the alignment. We examined this case setting : = 150nt or 50 aa, and displacement length x = 30nt or

15aa (see Methods) to assess the accuracy of !" methods relative to the MSA-based approach, at both the least-divergent and most-divergent sequence sets (T1 and T4 respectively in Fig. 1). For nucleotide sequences, both approaches recovered the reference topologies perfectly (Q = 0) in all cases. Interestingly, for protein sequences the !" methods performed better than the MSA-based approach in recovering the reference topologies (Q > 0 across all !" methods), particularly when sequences are highly similar (T1). As shown in Fig. S10c, QD2n1 = 0.37 and 0.06 respectively over T1 and T4. By contrast, deletions arrayed in this way negatively impact MSA, leading to inaccurate

153 phylogenetic inference using MrBayes. Interestingly, smaller Q values are observed when MAFFT was used as the MSA tool instead of MUSCLE, e.g. QD2n1 = 0.37 and 0.06 respectively over T1 and T4. For these vertically staggered deletions, we set the displacement length x to be shorter than the length : of each deleted region, such that the deleted regions overlap to some extent with one another. Where x = : (no deleted region is overlapped), both alignment-free and MSA-based approach recovered the same topology (Q = 0). This shows that overlapping deletions in protein sequences negatively impact MSA but not !" methods.

Analysis of truncated or incomplete sequence data

These observations raise the question of how sensitive k-mer methods are to truncated or incomplete sequence data. We observed an increase of RFD2n1 with increasing proportion of truncated sequences within a sequence set, and note that a larger k could be useful in these cases. For instance, the smallest mean RFD2n1 observed is at k = 16 (0.214), k = 20 (0.423), and k = 24 (0.578) for cases in which the proportion of truncated sequences is set at 0.25, 0.50 and 0.75 (Supplementary Fig. S13). In an independent analysis, we observed that the accuracy of !" methods increases with increasing sequence length (RFD2n1 > 0.3 at L = 250nt, RFD2n1 = 0 across all k at L ≥ 10000nt). This suggests that where sets include sequences of different length, the degree to which the accuracy of !" methods is degraded probably depends on the magnitude of differences among these lengths, and that the use of longer k-mers cases could be beneficial. These issues remain to be systematically investigated.

Comparison of MUSCLE versus MAFFT approaches

For each scenario, to assess the difference between the observed Q values from MUSCLE+MrBayes and from MAFFT+MrBayes we performed a two-sided paired Student’s T-test. A difference is considered statistically significant if p ≤ 0.05; this significance simply reflects the difference in performance between the two MSA tools, not against that of the alignment-free approach.

References

1. Torney, D.C., Burks, C., Davison, D. & Sirotkin, K.M. in Computers and DNA - Santa Fe Institute Studies in the Sciences of Complexity Vol 7. (eds. G. Bell & R. Marr) 109-125 (Addison- Wesley, Reading, MA; 1990).

2. Shepp, L. Normal functions of normal random variables. SIAM Review 6, 459-460 (1964).

3. Wan, L., Reinert, G., Sun, F. & Waterman, M.S. Alignment-free sequence comparison (II): theoretical power of comparison statistics. J Comput. Biol. 17, 1467-1490 (2010).

154 4. Reinert, G., Chew, D., Sun, F. & Waterman, M.S. Alignment-free sequence comparison (I): statistics and power. J Comput. Biol. 16, 1615-1634 (2009).

5. Song, K. et al. New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief. Bioinform. 15, 343-353 (2014).

6. Göke, J., Schulz, M.H., Lasserre, J. & Vingron, M. Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics 28, 656-663 (2012).

7. Edgar, R.C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792-1797 (2004).

8. Ronquist, F. et al. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst. Biol. 61, 539-542 (2012).

9. Yang, Z. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11, 367-372 (1996).

10. Chan, C.X., Darling, A.E., Beiko, R.G. & Ragan, M.A. Are protein domains modules of lateral genetic transfer? PLoS ONE 4, e4524 (2009).

11. Chan, C.X., Mahbob, M. & Ragan, M.A. Clustering evolving proteins into homologous families. BMC Bioinformatics 14, 120 (2013).

12. Golubchik, T., Wise, M.J., Easteal, S. & Jermiin, L.S. Mind the gaps: evidence of bias in estimates of multiple sequence alignments. Mol. Biol. Evol. 24, 2433-2442 (2007).

155 Supplementary Figures and Tables

a. D2 (i) N = 8 (ii) N = 32 (iii) N = 128 0.6 0.6 0.6 k k k 0.5 0.5 4 4 0.5 4 8 8 8 12 12 12 0.4 0.4 16 16 0.4 16 20 20 20 0.3 0.3 24 24 0.3 24 RF RF RF 0.2 0.2 0.2 0.1 0.1 0.1 0.0 0.0 0.0 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 Tree Tree Tree

S b. D2 (i) N = 8 (ii) N = 32 (iii) N = 128 0.6 0.6 0.6 k k k 4 4 4 0.5 0.5 8 8 0.5 8 12 12 12 0.4 0.4 16 16 0.4 16 20 20 20 24 24 24 0.3 0.3 0.3 RF RF RF 0.2 0.2 0.2 0.1 0.1 0.1 0.0 0.0 0.0 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 Tree Tree Tree

c. D2* (i) N = 8 (ii) N = 32 (iii) N = 128 0.6 0.6 k k 0.6 k 4 4 4 0.5 0.5 8 8 0.5 8 12 12 12 0.4 0.4 16 16 0.4 16 20 20 20 24 24 24 0.3 0.3 0.3 RF RF RF 0.2 0.2 0.2 0.1 0.1 0.1 0.0 0.0 T1 T2 T3 T4 T5 T6 0.0 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 Tree Tree Tree

d. D n=1 2 (i) N = 8 (ii) N = 32 (iii) N = 128 0.8 0.8 k k 0.8 k 4 4 4 8 8 8 0.6 0.6 12 12 0.6 12 16 16 16 20 20 20 24 24 24 RF RF RF 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 T1 T2 T3 T4 T5 T6 0.0 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 Tree Tree Tree

Supplementary Figure S1. RF for each D2 method on nucleotide sequence sets across different trees. For S * n=1 each of D 2 , D 2 , D 2 , and D 2 , and at sequence set size N = 8, 32 and 128, mean RF is shown across different k-mer lengths, for each tree T1 through T6. Error bars indicate standard deviation from the mean.

156 a. D2 (i) N = 8 (ii) N = 32 (iii) N = 128 1.0 1.0 1.0

k k k 4 4 4 0.8 0.8 8 8 0.8 8 12 12 12 16 16 16 0.6 0.6 20 20 0.6 20 24 24 24 RF RF RF 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 Tree Tree Tree

S b. D2 (i) N = 8 (ii) N = 32 (iii) N = 128 1.0 1.0 1.0 k k k 4 4 4 0.8 0.8 8 8 0.8 8 12 12 12 16 16 16 0.6 0.6 20 20 0.6 20 24 24 24 RF RF RF 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 Tree Tree Tree

c. D2* (i) N = 8 k (ii) N = 32 (iii) N = 128 4 1.0 1.0 k 8 1.0 k 4 12 4 8 16 8 12 20 12 0.8 0.8 16 24 0.8 16 20 20 24 24 0.6 0.6 0.6 RF RF RF 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 Tree Tree Tree

n=1 d. D2 (i) N = 8 (ii) N = 32 (iii) N = 128 0.8 0.8 0.8 k k k 4 4 4 8 8 8 0.6 0.6 12 12 0.6 12 16 16 16 20 20 20 24 24 24 0.4 0.4 0.4 RF RF RF 0.2 0.2 0.2 0.0 0.0 T1 T2 T3 T4 T5 T6 0.0 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 Tree Tree Tree

Supplementary Figure S2. RF for each D2 method on protein sequence sets across different tree topologies. S * n=1 For each of D 2 , D 2 , D 2 , and D 2 , and at sequence set size N = 8, 32 and 128, mean RF is shown across different k-mer lengths, for each tree T1 through T6. Error bars indicate standard deviation from the mean.

157 (i) N = 8 a. D2 (ii) N = 32 (iii) N = 128 Tree Tree Tree 0.1 0.1 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 0.1 T1 T2 T3 T4 T5 T6 0.0 0.0 0.0 -0.1 -0.1 -0.1 -0.2 -0.2 k k -0.2 k 4 4 4 Q value Q value 8 8 Q value 8 -0.3 -0.3 12 12 -0.3 12 16 16 16 20 20 20 -0.4 -0.4 24 24 -0.4 24 -0.5 -0.5 -0.5

S b. D2 (i) N = 8 (ii) N = 32 (iii) N = 128 Tree Tree Tree 0.1 0.1 T1 T2 T3 T4 T5 T6 0.1 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 0.0 0.0 0.0 -0.1 -0.1 -0.1 -0.2 -0.2 k k -0.2 k 4 4 4 Q value Q value 8 8 Q value 8 -0.3 -0.3 12 12 -0.3 12 16 16 16 20 20 20 -0.4 -0.4 24 24 -0.4 24 -0.5 -0.5 -0.5

c. D2* (i) N = 8 (ii) N = 32 (iii) N = 128 Tree Tree Tree 0.1 0.1 0.1 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 0.0 0.0 0.0 -0.1 -0.1 -0.1

k k

-0.2 -0.2 4 k -0.2 4 4 8 8 Q value Q value 8 12 Q value 12 -0.3 -0.3 12 16 -0.3 16 16 20 20 20 24 24 -0.4 -0.4 24 -0.4 -0.5 -0.5 -0.5

n=1 d. D2 (i) N = 8 (ii) N = 32 (iii) N = 128 Tree Tree Tree T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 0.0 0.0 0.0 -0.2 -0.2 -0.2

k k k -0.4 -0.4 4 4 -0.4 4 Q value Q value 8 8 Q value 8 12 12 12 16 16 16 -0.6 -0.6 20 20 -0.6 20 24 24 24 -0.8 -0.8 -0.8

Supplementary Figure S3. Average Q value for each D2 method on nucleotide sequence sets across different S * n=1 trees. For each of D 2 , D 2 , D 2 , and D 2 , and at sequence set size N = 8, 32 and 128, mean Q is shown across different k-mer lengths, for each tree T1 through T6. Error bars indicate standard deviation from the mean.

158 a. D2 (i) N = 8 (ii) N = 32 (iii) N = 128 Tree Tree Tree T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 0.0 0.0 0.0 -0.2 -0.2 -0.2 -0.4 -0.4 -0.4 k k k 4 4 4 Q value Q value Q value -0.6 8 -0.6 8 -0.6 8 12 12 12 16 16 16

-0.8 20 -0.8 20 -0.8 20 24 24 24 -1.0 -1.0 -1.0

S b. D2 (i) N = 8 (ii) N = 32 (iii) N = 128 Tree Tree Tree T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 0.0 0.0 0.0 -0.2 -0.2 -0.2

-0.4 k -0.4 k -0.4 k 4 4 4 8 8 8 Q value Q value Q value -0.6 12 -0.6 12 -0.6 12 16 16 16 20 20 20 -0.8 24 -0.8 24 -0.8 24 -1.0 -1.0 -1.0

c. D2* (i) N = 8 (ii) N = 32 (iii) N = 128 Tree Tree Tree T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 0.0 0.0 0.0 -0.2 -0.2 -0.2 -0.4 -0.4 -0.4

k Q value Q value Q value -0.6 4 -0.6 -0.6 8 k 12 4 k

-0.8 16 -0.8 8 -0.8 4 20 12 8 24 16 12

-1.0 -1.0 20 -1.0 16 24 20 24 n=1 d. D2 (i) N = 8 (ii) N = 32 (iii) N = 128 Tree Tree Tree T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 0.0 0.0 0.0 -0.2 -0.2 -0.2

k k k

-0.4 4 -0.4 4 -0.4 4 Q value 8 Q value 8 Q value 8 12 12 12 16 16 16 -0.6 20 -0.6 20 -0.6 20 24 24 24 -0.8 -0.8 -0.8

Supplementary Figure S4. Average Q value for each D2 method on protein sequence sets across different S * n=1 trees. For each of D 2 , D 2 , D 2 , and D 2 , and at sequence set size N = 8, 32 and 128, mean Q is shown across different k-mer lengths, for each tree T1 through T6. Error bars indicate standard deviation from the mean.

159 a. 0.20

D2 S D2

0.15 D*2 n=1 N = 8 N = 32 D2 0.10 N = 128 RF 0.05

0.00 8 12 4 8 12 16 20 24 4 8 12 16 20 24 16 k 4 k 20 24 k

b.

0.8 D2 S N = 8 N = 32 D2

D*2 n=1 D2 0.6

N = 128 0.4 RF 0.2

0.0 4 8 12 16 20 24 4 8 12 16 20 24 4 8 12 16 20 24 k k k

Supplementary Figure S5. The accuracy of D2 methods based on k-mer lengths. For each of (a) nucleotide S * n=1 and (b) protein sequences sets, RF is shown for each method of D 2 , D 2 , D 2 , and D 2 , across different k-mer lengths (indicated on the x-axis), for each size N of 8, 32 and 128. Error bars indicate standard deviation from the mean.

160 a. (i) (ii) (iii)

0.20 α = 0.01 D2 S D2 0.15 D*2 n=1 D2 0.10 α = 0.5 / 1.5 α = 5.0 0.05 Q value

0.00 128 8 32 128 8 32 128 32 N N

N 8

(i) α = 0.01 (ii) (iii) b. D2 0.2 S D2

D*2 n=1 D2 0.1 α = 0.5 / 1.5 α = 5.0 0.0

Q value 8 32 128 8 32 32 N 128 8 128 N -0.1 N -0.2

Supplementary Figure S6. The accuracy of D2 methods based on among-site rate heterogeneity. For each S * n=1 of (a) nucleotide and (b) protein sequences sets, Q value is shown for each method of D 2 , D 2 , D 2 , and D 2 , across different k-mer lengths (indicated on the x-axis) for each size N of 8, 32 and 128, for each dataset with (i) a uniform α = 0.1, (ii) mixed α of 0.5 and 1.5, and (iii) a uniform α = 5.0. Error bars indicate standard deviation from the mean.

161 a. α = 0.01 0.2

0.1 4 0.0 8 12 16 20 24 Q value -0.1 k 4 8 12 16 20 24 4 8 12 16 20 24 k k

-0.2 D2 S N = 8 N = 32 N = 128 D2 -0.3 D*2 n=1 D2

b. α = 0.75 / 1.25 24 k k k 0.05 4 8 12 16 20 4 8 12 16 20 24 4 8 12 16 20 24 0.00 -0.05

-0.10 D2

Q value S N = 8 D2 -0.15 D*2 N = 32 N = 128 n=1 D2 -0.20 -0.25 c. α = 0.50 / 1.50 k 4 8 12 16 20 24 k k 0.05 4 8 12 16 20 24 4 8 12 16 20 24 0.00 -0.05

-0.10 D2

Q value S N = 8 D2 -0.15 D*2 N = 32 N = 128 Dn=1 -0.20 2 -0.25

d. α = 0.10 / 2.10 24 k k 0.05 4 k 20 4 8 12 16 20 24 4 8 12 16 20 24 8 12 16 0.00

-0.05 N = 8 D2 Q value -0.10 S D2 N = 32

D*2 -0.15 n=1 D2 N = 128 -0.20

e. α = 5.00 4 k 20 k k 8 12 16 24 4 8 12 16 20 24 4 8 12 16 20 24 0.00 -0.10

D2 Q value S D2 N = 8 N = 32 N = 128

-0.20 D*2 n=1 D2 -0.30

Supplementary Figure S7. Average Q value for each D2 method on nucleotide sequence sets across different S * n=1 categories of among-site rate heterogeneity. For each of D 2 , D 2 , D 2 , and D 2 , and at sequence set size N = 8, 32 and 128, mean Q is shown across different k-mer lengths, for each category of α at 0.01, 0.75/1.25, 0.5/1.5, 0.1/2.1 and 5.0. Error bars indicate standard deviation from the mean.

162 a. α = 0.01 0.2 0.1 4 0.0 Q value -0.1 4 8 12 16 20 24 4 8 12 16 20 24 8 12 16 20 24 k k k

-0.2 D2 S N = 8 N = 32 N = 128 D2 -0.3 D*2 n=1 D2

b. α = 0.75 / 1.25 24 k k k 0.05 4 8 12 16 20 4 8 12 16 20 24 4 8 12 16 20 24 0.00 -0.05

-0.10 D2

Q value DS 2 N = 8 N = 32 N = 128 -0.15 D*2 n=1 D2 -0.20 -0.25

c. α = 0.50 / 1.50 k 24 k k

0.05 4 8 12 16 20 4 8 12 16 20 24 4 8 12 16 20 24 0.00 -0.05

-0.10 N = 8 D2

Q value S D2

-0.15 N = 32 N = 128 D*2 Dn=1 -0.20 2 -0.25

d. α = 0.10 / 2.10

24 k k

0.05 k 4 20 4 8 12 16 20 24 8 12 16 4 8 12 16 20 24 0.00

-0.05 N = 8 D2 Q value

-0.10 S D2 N = 32 N = 128 D*2 -0.15 n=1 D2 -0.20

e. α = 5.00 4 k 20 k k 8 12 16 24 4 8 12 16 20 24 4 8 12 16 20 24 0.00

-0.10 N = 8 N = 32 N = 128 D2 Q value S D2

-0.20 D*2 n=1 D2 -0.30

Supplementary Figure S7. Average Q value for each D2 method on nucleotide sequence sets across different S * n=1 categories of among-site rate heterogeneity. For each of D 2 , D 2 , D 2 , and D 2 , and at sequence set size N = 8, 32 and 128, mean Q is shown across different k-mer lengths, for each category of α at 0.01, 0.75/1.25, 0.5/1.5, 0.1/2.1 and 5.0. Error bars indicate standard deviation from the mean.

163 a. α = 0.01 k k k 4 8 12 16 20 24 4 8 12 16 20 24 4 8 12 16 20 24 0.2 0.1 0.0 -0.1 -0.2 Q value

-0.3 D2 S D2 -0.4 N = 8 N = 32 N = 128 D*2

-0.5 n=1 D2

b. α = 0.75 / 1.25 k

0.1 4 8 12 16 20 24 k k 4 8 12 16 20 24 4 8 12 16 20 24 0.0 -0.1

D2

Q value S -0.2 D2

D*2

-0.3 n=1 D2 N = 8 N = 32 N = 128 -0.4

c. α = 0.50 / 1.50 k

0.1 4 8 12 16 20 24 k k 4 8 12 16 20 24 4 8 12 16 20 24 0.0 -0.1

D2 S Q value -0.2 D2

D*2

-0.3 Dn=1 2 N = 8 N = 32 N = 128 -0.4

d. α = 0.10 / 2.10 k 0.1 4 8 12 16 20 24 k k 4 8 12 16 20 24 4 8 12 16 20 24 0.0 -0.1

D2 S Q value -0.2 D2

D*2 n=1 -0.3 D 2 N = 8 N = 32 N = 128 -0.4

e. α = 5.00 k 0.1 4 8 12 16 20 24 k k 4 8 12 16 20 24 4 8 12 16 20 24 0.0 -0.1

D2 S Q value -0.2 D2

D*2 n=1 -0.3 D 2 N = 8 N = 32 N = 128 -0.4

Supplementary Figure S8. Average Q value for each D2 method on protein sequence sets across different S * n=1 categories of among-site rate heterogeneity. For each of D 2 , D 2 , D 2 , and D 2 , and at sequence set size N = 8, 32 and 128, mean Q is shown across different k-mer lengths, for each category of α at 0.01, 0.75/1.25, 0.5/1.5, 0.1/2.1 and 5.0. Error bars indicate standard deviation from the mean.

164 a.

0.30 G+C = 0.9 D2 S D2 0.25

D*2 n=1 D2 0.20 G+C = 0.7 0.15

RF G+C = 0.5 0.10 0.05

0.00 8 32 128 8 32 128 8 32 128 N N N

b. G+C = 0.9 0.30

D2 S

0.25 D2

D*2 n=1 D2 0.20 G+C = 0.7 0.15

RF G+C = 0.5 0.10 0.05

0.00 8 32 128 8 32 128 8 32 128 N N N

Supplementary Figure S9. The accuracy of D2 methods based on compositional biases. For each of S * n=1 (a) nucleotide and (b) protein sequence sets, RF is shown for each method of D 2 , D 2 , D 2 , and D 2 , at optimal k (k = 8 for nucleotides, 4 for proteins) across different sizes N at 8, 32 and 128, for each dataset with a G+C proportion at 0.5, 0.7, and 0.9. Error bars indicate standard deviation from the mean.

165 k = 8 a. k = 12 0.4 k = 16 k = 20

0.3 k = 24 R = 50% MSA R = 25% 0.2 RF 0.1 0.0 R = 10% -0.1 b. k = 8 0.5 k = 12

0.4 k = 16 k = 20 R = 25% 0.3 k = 24 0.2 0.1 Q value 0.0

-0.1 R = 10% R = 50% -0.2

Supplementary Figure S10. The accuracy of D 2 methods based on genetic rearrangement compared to MAFFT-based approach. RF D2n1 are shown in (a) across different k-mer lengths (k

≥ 8), as well as that of the standard approach ( RF MSA ), across different R at 10%, 25% and 50%.

The corresponding Q D2n1 values are shown in (b). Error bars indicate standard deviation from the mean.

166 a. 0.8 RAxML MrBAYES 0.6 S D2 , k = 8 0.4 RF 0.2

0.0 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 indel rate indel rate indel rate

b. 0.8 S S D2 D2 versus versus MrBAYES RAxML 0.6 Q 0.4 0.2

0.0 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 indel rate indel rate

Supplementary Figure S11. The accuracy of MAFFT-based approaches based on insertions/deletions. S RF values are shown in (a) for D 2 , MAFFT+MrBayes and MAFFT+RAxML across different indel rates r. The corresponding Q values for MAFFT+MrBayes and MAFFT+RAxML are shown in (b). Error bars indicate standard deviation from the mean.

167 a. b. 0.0 0.0 -0.2 -0.1 d = 10nt d = 50nt -0.4 -0.2 Q value d = 3aa Q value

D2 D2 S S D2 -0.6 D2 -0.3 * D2 D*2 d = 15aa n=1 n=1 d = 50aa D2 d = 150nt D2 -0.8 -0.4

T1 0.6 c. d. 0.6 D 2 D2 S S 0.5 D2 0.5 D2 D* 2 D*2 n=1 n=1 0.4 D2 0.4 D2 0.3 0.3

Q value T4 T1 T4 Q value 0.2 0.2 0.1 0.1 0.0 0.0 -0.1 -0.1

Supplementary Figure S12. The accuracy of D2 methods based on other instances of S * n=1 insertions/deletions. Q values are shown for D 2 , D 2 , D 2 and D 2 across different lengths l of deleted regions, for (a) nucleotide sequences (k = 8) at l = 10nt, 50nt and 150nt, and for (b) protein sequences (k = 4) at l = 3aa, 15aa and 50aa. The corresponding Q values in the analysis of vertically staggered deletions are shown for protein sequences, simulated independently under trees T1 and T4, using (c) MUSCLE and (d) MAFFT as MSA tool in phylogenetic analysis. Error bars indicate standard deviation from the mean.

168 a. 1.0 k z = 0.50 z = 0.75 8 12 16 0.8 20 24 0.6 z = 0.25 RF 0.4 0.2 0.0 b. 1.0

k 8 0.8 12 16 20

0.6 24 RF 0.4 0.2 0.0

250 1500 10000 100000 250000 250/ 250/ 1500 10000 Sequence length (L)

Supplementary Figure S13. The accuracy of D2 methods based on sequence lengths and incomplete data. RFD2n1 across different k length is shown across (a) the proportion of truncated sequences in nucleotide sequence sets (N =128) and (b) at varied sequence lengths within the set. Error bars indicate standard deviation from the mean.

169 a. 0.5 D n=1, k = 10 1K 2 10K 100K 0.4 MrBAYES 250K RAxML 500K 1000K 0.3 RF 0.2 0.1 0.0

b. 0.1

1K

0.0 10K 100K 250K 500K -0.1 1000K Q -0.2

n=1 n=1 D2 D2

-0.3 versus versus MrBAYES RAxML -0.4

Supplementary Figure S14. The accuracy of MAFFT approaches based on coalescent evolution of n=1 gene families. RF values are shown in (a) for D 2 , MAFFT+MrBayes and MAFFT+RAxML across different effective population size N e . The corresponding Q values for MAFFT+MrBayes and MAFFT+RAxML are shown in (b). Error bars indicate standard deviation from the mean.

170 a. MUSCLE b. MAFFT 1K 10K 0.7 D n=1, k = 10 100K 2 250K 500K 0.6 0.6 1000K D n=1, k = 10 1K MrBAYES 2 10K 100K 0.5 0.5 MrBAYES 250K RAxML 500K 1000K 0.4 0.4 0.3 0.3 RAxML RF RF 0.2 0.2 0.1 0.1 0.0 0.0

n=1 n=1 n=1 n=1 0.1 0.1 D2 D2 D2 D2 versus versus versus versus MrBAYES RAxML MrBAYES RAxML 0.0 0.0 -0.1 -0.1 Q Q -0.2 -0.2

1K 1K 10K 10K -0.3 100K -0.3 100K 250K 250K 500K 500K 1000K 1000K -0.4 -0.4

Supplementary Figure S15. The accuracy of phylogenetic approaches based on coalescent evolution of gene families that violates molecular clock. RF values are shown across different effective n=1 n=1 population size N e , for (a) D 2 , MUSCLE+MrBayes and MUSCLE+RAxML, and for (b) D 2 , MAFFT+MrBayes and MAFFT+RAxML. The corresponding Q values for each comparison of n=1 MSA-based approach against D 2 are shown at the bottom panels. Error bars indicate standard deviation from the mean.

171 a. 1200 1000 800 600 Frequency 400 200 0

0 100 200 300 400 500 Size of sequence set (N) b. 350 300 250 200 Frequency 150 100 50 0

0 20 40 60 80 100 Within-set sequence similarity (% identity)

Supplementary Figure S16. Distribution of (a) sequence set sizes and (b) within-set sequence similarity for the 2471-tree TreeBASE dataset.

172 0.5 0.4 N = 8 N = 128 0.3 RF 0.2 0.1

0.0 1 2 3 4 5 1 2 3 4 5

Neighbourhood (n) Neighbourhood (n)

n Supplementary Figure S17. The accuracy of D 2 based on neighbourhood. RFD2n across different neighbourhood n for nucleotide sequence sets of size N = 8 and 128, simulated under tree T1. Error bars indicate standard deviation from the mean.

Supplementary Table S1. Observed RF for all approaches on the sequence sets simulated under coalescent model and average branch length of corresponding trees (unit in number of substitutions per site).

Effective Average branch Dn=1 MUSCLE+MrBayes MUSCLE+RAxML population 2 length of a tree RF (mean ± s.d.) RF (mean ± s.d.) RF (mean ± s.d.) size (Ne) (mean ± s.d.)

1000 0.2397 ± 0.0910 0.1617 ± 0.0454 0.2134 ± 0.0883 0.0026 ± 0.0008

10000 0.0717 ± 0.0487 0.0545 ± 0.0364 0.0648 ± 0.0522 0.0277 ± 0.0099

100000 0.0728 ± 0.0516 0.0674 ± 0.0473 0.0631 ± 0.0555 0.2382 ± 0.0789

250000 0.1193 ± 0.0627 0.0771 ± 0.0537 0.1031 ± 0.0657 0.7031 ± 0.2353

500000 0.2390 ± 0.0826 0.0912 ± 0.0487 0.0931 ± 0.0593 1.2518 ± 0.3467

1000000 0.4072 ± 0.0836 0.1231 ± 0.0647 0.1283 ± 0.0764 2.8509 ± 0.9246

173 Supplementary Table S2. Mean and median of RF observed for each D2 method on the TreeBASE dataset.

Method k RF (mean ± s.d.) Median RF

D2 6 0.4603 ± 0.2133 0.4286

S D 6 0.4558 ± 0.2150 0.4211 2 *

D2 6 0.4896 ± 0.2049 0.4583

n=1 D2 6 0.5048 ± 0.2185 0.4742

D 8 0.4420 ± 0.2088 0.4107 2 S

D2 8 0.4382 ± 0.2077 0.4091

* D2 8 0.4804 ± 0.2010 0.4523 n=1 D 8 0.4500 ± 0.2147 0.4167 2

Supplementary Table S3. Mean and median of inferred RF grouped based on sizes of the TreeBASE sequence sets.

Size category RF (mean ± s.d.) Median RF

N ≤ 25 0.3630 ± 0.2338 0.3333

26 ≤ N ≤ 50 0.4261 ± 0.1968 0.3950

51 ≤ N ≤ 75 0.4589 ± 0.1858 0.4252

76 ≤ N ≤ 100 0.4833 ± 0.1712 0.4533

101 ≤ N ≤ 200 0.5277 ± 0.1839 0.4935

201 ≤ N ≤ 500 0.5904 ± 0.1799 0.5801

N ≥ 501 0.6609 ± 0.1220 0.6351

174 Supplementary Table S4. Mean and median of inferred RF grouped based on within-set sequence similarity of the TreeBASE sequence sets.

Sequence similarity (ID) RF (mean ± s.d.) Median RF

ID ≥ 90 0.4397 ± 0.2113 0.4097

80 ≤ ID < 90 0.4235 ± 0.1992 0.3917

70 ≤ ID < 80 0.4604 ± 0.2036 0.4259

ID < 70 0.5332 ± 0.1921 0.5278

Supplementary Table S5. Average computation time and memory usage of D2 across different size of sequence sets of the GreenGenes dataset.

Size of CPU time in seconds Memory usage in MB sequence set (N) (mean ± s.d.) (mean ± s.d.)

1000 49.77 ± 5.11 378.24 ± 0.46

2000 137.14 ± 10.81 689.16 ± 6.00

3000 298.08 ± 24.82 1028.52 ± 84.02

4000 542.97 ± 36.52 1744.46 ± 96.67

5000 842.98 ± 80.00 2445.31 ± 49.92

175 Supplementary Table S6. Mean Q observed for each D2 method across instances of simulated deletions, in comparison to MUSCLE+MrBayes and MAFFT+MrBayes. The p values (two- sided Student paired T-test) show statistical significance of the observed Q values between the use of MUSCLE and MAFFT being different. See Supplementary Note for more detail.

Nucleotide sequences Deletion MUSCLE+MrBayes MAFFT+MrBayes d D2 method Q (mean ± s.d.) Q (mean ± s.d.)

10nt D2 -0.0255 ± 0.0297 -0.0255 ± 0.0297 S 10nt D2 -0.0262 ± 0.0299 -0.0262 ± 0.0299 * 10nt D2 -0.0266 ± 0.0313 -0.0266 ± 0.0313 n=1 10nt D2 -0.0155 ± 0.0252 -0.0155 ± 0.0252 50nt D2 -0.0400 ± 0.0424 -0.0397 ± 0.0428 S 50nt D2 -0.0407 ± 0.0431 -0.0403 ± 0.0433 * 50nt D2 -0.0407 ± 0.0420 -0.0403 ± 0.0419 n=1 50nt D2 -0.0303 ± 0.0374 -0.0300 ± 0.0375 150nt D2 -0.1845 ± 0.0961 -0.1828 ± 0.0964 S 150nt D2 -0.1900 ± 0.1020 -0.1883 ± 0.1020 * 150nt D2 -0.1893 ± 0.0912 -0.1876 ± 0.0917 n=1 150nt D2 -0.2352 ± 0.1132 -0.2334 ± 0.1142

p = 0.5769 (1200 samples in each approach)

Protein sequences

Deletion MUSCLE+MrBayes MAFFT+MrBayes D2 method d Q (mean ± s.d.) Q (mean ± s.d.)

3aa D2 -0.0690 ± 0.0578 -0.0693 ± 0.0573 S 3aa D2 -0.0703 ± 0.0586 -0.0707 ± 0.0581 * 3aa D2 -0.2683 ± 0.1352 -0.2686 ± 0.1351 n=1 3aa D2 -0.0534 ± 0.0515 -0.0538 ± 0.0510 15aa D2 -0.0979 ± 0.0673 -0.0990 ± 0.0670 S 15aa D2 -0.0997 ± 0.0696 -0.1007 ± 0.0691 * 15aa D2 -0.3438 ± 0.1518 -0.3448 ± 0.1511 n=1 15aa D2 -0.0900 ± 0.0670 -0.0910 ± 0.0660 50aa D2 -0.4803 ± 0.1480 -0.4828 ± 0.1483 S 50aa D2 -0.4721 ± 0.1494 -0.4745 ± 0.1499 * 50aa D2 -0.6497 ± 0.1286 -0.6521 ± 0.1288 n=1 50aa D2 -0.6021 ± 0.1211 -0.6045 ± 0.1224

p = 0.5331 (1200 samples in each approach)

176 APPENDIX B Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer

Supplementary information

Alignment-free methods used in this study

Word-count methods

# # 1 !" statistic. !" generate a score for each possible pair of sequences within a set based on k-mer count. These scores are transformed via logarithmic representation of the geometric mean to generate 2 # a distance . Generation of the distance matrix using !" is implemented in a JAVA program, jD2Stat 2, which is freely available at bioinformatics.org.au/tools/jD2Stat/.

Feature frequency profiles. The ffp pipeline 3-5 builds the k-mer frequency profile for each sequence, and then uses the Jensen-Shannon divergence 4 to compare their profiles and generate a distance between the sequences.

Composition vector. The composition vector method cvt shares the same principle as ffp, but here the k-mer frequencies are divided by the frequencies expected by chance alone 6.

Co-phylog. This method counts the proportion of shared words that differ in the middle position as a distance between two sequences, i.e. a mutation point surrounded by a context of certain length present in both sequences (with K, the half-context length at each side of the mutation point) 7.

Spaced-word frequencies. Spaced uses sets of match and mismatch position patterns to compare the spaced-word frequencies between two sequences. The patterns can vary in length and number n 8,9.

Match-lengths methods

Grammar-based method. The method gram uses a set of rules to decompose a string into substrings, e.g. CCCTTTAA decomposed as C3T3A2. The idea is that closely related sequences are more compressible than divergent sequences 10. Different compression schemes can be used, but Lempel- Ziv factorisation (used for gzip compression) is the most common 11.

Average Common Substring. The Average Common Substring (acs) method 12 uses the concept of matching statistics 13. Instead of decomposing the concatenation of two sequences, this method searches for the longest match in sequence A starting at every position in sequence B. Unlike the Lempel-Ziv factorisation method, here the longest matches can overlap.

177 Shortest unique substring. Instead of looking for the longest matches between two sequences, kr looks for the longest common substrings extended by one, known as the longest substrings between two mutations, i.e. the SHortest Unique subSTRING, shustring 14. k-Mismatch Average Common Substring (kmacs). This method is a variant of the acs method, using the longest common substring with k-mismatches (the number of mismatches is noted mm in our study) 9,15.

Optimisation of parameter settings

Six of the nine AF methods used in this study require the specification of a key parameter that would # influence the estimated distance among a set of sequences, thus the resulting trees. !" , ffp and cvt require a value to be set for k (i.e. k-mer length), co-phylog requires a half-context length K, spaced a number of patterns n, and kmacs a number of mismatches mm. To assess the optimal parameter setting corresponding to each of these methods in inferring phylogenies across the simulated data, we ran the program across a range of values and inferred a phylogenetic tree. We used k = [14-26] for # 15 !" , ffp and cvt, K = 6, 7, 8 and 9 for co-phylog, and based on authors’ recommendations , n = 60, 70, 80, 90 and 100 for spaced, and mm = [12-20] for kmacs.

We consider, in each scenario, the parameter that yielded the minimum average RF (i.e. the tree topology that is the most congruent to reference) as optimal. The other three methods, gram, kr and acs were run using the default parameters.

Analysis of genome divergence. Supplementary Fig. S1 shows the mean RF observed using the six # AF methods across different parameter settings and the mutation rate m. For !" (Supplementary Fig. S1a), ffp (Supplementary Fig. S1b) and cvt (Supplementary Fig. S1c), the optimal k-mer length slightly increases proportionately as m (e.g. for cvt, the optimal k = 17, 20 and 21 at m = 0.1, 0.5 and 0.9). Minimum RF values observed using optimal k in these methods do not differ by much and are relatively small, i.e. <0.05 in most cases. For co-phylog (Supplementary Fig. S1d) the best K is either 6 or 7 depending on m. For spaced (Supplementary Fig. S1e), varying n yielded very similar RF values (e.g. RF = 0.5 across all n examined at m = 0.9). Finally, for kmacs (Supplementary Fig. S1f), the optimal mm increases as m increases, e.g. optimal mm = 12, 19 and 20 at m = 0.1, 0.5 and 0.9; the difference in RF values across mm examined is relatively small, i.e. <0.025 across incremental mm in each case.

178 a) b) ffp

0.20 0.4

parameter 0.15 parameter 0.3 k=14 k=14 k=15 k=16 RF k=15 0.10 RF k=16 0.2 k=17 k=17 k=18 k=18 k=19 0.05 0.1 k=20

0.00 0.0

0.1 0.5 0.9 0.1 0.5 0.9 m m c) cvt d) co-phylog

0.6 0.3

parameter parameter 0.4 K=5 0.2 k=16 k=17 K=6 RF

RF k=18 K=7 k=19 K=8 k=20 0.2 K=9 0.1 k=21 k=22

0.0 0.0 0.1 0.5 0.9 0.1 0.5 0.9 m m e) spaced f) kmacs 0.6 0.15

parameter 0.0 mm =12 parameter 0.4 mm =13 n =60 0.10 mm =14 n =70 mm =15

RF n =80 RF mm =16 n =90 mm =17 n =100 mm =18 0.2 0.05 mm =19 mm =20

0.0 0.00

0.1 0.5 0.9 0.1 0.5 0.9 m m Supplementary Figure S1: Accuracy of AF methods, at different parameters, based on m. RF distances are shown for the AF methods with different parameters across different m. Each value is an average among 50 replicates and the error bars indicate standard deviation from the mean.

179 a) b) ffp 0.20

0.3

0.15 parameter parameter k=14 k=14 0.2 k=15 k=15 0.10

RF k=16 k=16 RF k=17 k=17 k=18 k=18 0.1 0.05 k=19 k=20

0.00 0.0

0 5 25 125 250 500 0 5 25 125 250 500 l c) cvt d) co-phylog l 0.25 0.6

0.20 parameter parameter k=16 K=5 0.4 0.15 k=17 K=6 k=18 K=7 RF k=19 RF K=8 0.10 K=9 k=20 k=21 0.2 k=22 0.05

0.00 0.0 0 5 25 125 250 500 0 5 25 125 250 500 e) spaced l f) kmacs l

0.4 0.20

parameter parameter mm=12 0.3 n=60 0.15 mm=13 n=70 mm=14 n=80 mm=15 RF RF n=90 mm=16 0.2 0.10 n=100 mm=17 mm=18 mm=19 0.1 0.05 mm=20

0.0 0.00

0 5 25 125 250 500 0 5 25 125 250 500 l l Supplementary Figure S2: Accuracy of AF methods, at different parameters, based on l. RF distances are shown for the AF methods with different parameters across different l. Each value is an average among 50 replicates and the error bars indicate standard deviation from the mean.

180 a) b) ffp

0.100 0.3

0.075 parameter parameter 0.2 k=14 k=14 k=15 RF k=15 k=16 0.050 k=16 k=17 k=17 k=18 0.1 k=18 k=19 0.025 k=20

0.000 0.0 200 1000 3000 5000 200 1000 3000 5000 d d

c) cvt d) co-phylogc 0.20 0.6

0.15 parameter 0.4 parameter k=16 K=5 k=17 K=6

0.10 k=18 RF K=7

RF k=19 K=8 k=20 0.2 K=9 k=21 0.05 k=22

0.00 0.0 200 1000 3000 5000 200 1000 3000 5000 d d e) spaced f) kmacs

0.4 0.08

parameter mm =12 0.3 parameter 0.06 mm =13 n =60 mm =14 n =70 mm =15 RF n =80 RF 0.2 0.04 mm =16 n =90 mm =17 n =100 mm =18 mm =19 0.1 0.02 mm =20

0.0 0.00 200 1000 3000 5000 200 1000 3000 5000 d d Supplementary Figure S3: Accuracy of AF methods, at different parameters, based on d. RF distances are shown for the AF methods with different parameters across different d. Each value is an average among 50 replicates and the error bars indicate standard deviation from the mean. Analysis of lateral genetic transfer. Supplementary Fig. S2 shows the mean RF obtained using the # six AF methods across different parameter settings and the LGT per iteration rate l. For !"

181 (Supplementary Fig. S2a) and cvt (Supplementary Fig. S2c) the optimal k-mer lengths decrease as l # increases (e.g. optimal k = 17, 15 and 14 at l = 0, 125, 500 for !" ). Minimum RF values observed using optimal k in these methods do not differ by much and are relatively small, i.e. < 0.05 across incremental k. For ffp, co-phylog and kmacs (Supplementary Figs S2b, S2d and S2f) the optimal parameter is always the same at different l values, i.e. optimal setting k = 20 for ffp, K = 6 for co- phylog and mm = 20 for kmacs. Finally, for spaced (Supplementary Fig. S2e), the use of different n settings yielded similar RF values, e.g. optimal n is the same across all tested values (60, 70, 80, 90 and 100) at l = 0, 25 and 250.

Supplementary Fig. S3 shows the mean RF obtained using the six AF methods across different parameter settings and the divergence factor d. For all the methods, the observed variation of RF is very little between the best parameters, e.g. for cvt, RF between 0.032-0.037 and 0.028–0.033 at k = 20 and 22 respectively (Supplementary Fig. S3c). Similarly to what we observed in Supplementary Figs S1e and S2e, varying the setting of n in spaced yielded almost identical RF.

Analysis of genome rearrangements. Supplementary Fig. S4 shows the mean RF obtained using five of the six AF methods across different parameter settings and the rearrangement rate r. Owing to the highly consistent results observed in spaced using different values of n (as described above), we did not assess the setting of n in spaced in this case. In all methods, the observed variation of RF is very little between the best parameters, e.g. for cvt, RF ranges between 0.068 (r = 1.00) and 0.070 (r = 0.10) at k = 14 (Supplementary Fig. S4c).

To visualise the extent of inverted translocation across these datasets, we computed the MSA of these datasets using progressiveMauve 16. Supplementary Fig. S5 shows an example of part of the whole genome alignments at each r. The number of locally collinear blocks 17 in Mauve alignments increases proportionately with r, illustrating the complexity in multiply aligning these genomes, particularly when r ≥ 0.1.

182 a) b) ffp 0.15 0.08

0.06 parameter 0.10 parameter k=14 k=14 k=15

RF k=15 RF k=16 0.04 k=16 k=17 k=17 k=18 k=18 0.05

0.02

0.00 0.00

0.00 0.01 0.10 1.00 0.00 0.01 r 0.10 1.00 r c) cvt d) co-phylog

0.20 0.15

parameter 0.15 parameter k=14 K=6 0.10 k=15 K=7 RF k=16 K=8 RF 0.10 k=17 k=18 0.05 0.05

0.00 0.00 0.00 0.01 0.10 1.00 0.00 0.01 0.10 1.00 r r f) kmacs

0.3

parameter mm =13 0.2 mm =14 mm =15 mm =16 RF mm =17 0.1 mm =18 mm =19 mm =20

0.0 0.00 0.01 0.10 1.00 r Supplementary Figure S4: Accuracy of AF methods, at different parameters, based on r. RF distances are shown for the AF methods with different parameters across different r. Each value is an average among 50 replicates and the error bars indicate standard deviation from the mean.

183 a) r = 0.00 b) r = 0.01

c) r = 0.10 d) r = 1.00

Supplementary Figure S5: Visualisation of inverted translocation rates (r) using progressiveMauve.

Analysis of genome divergence in empirical data

To estimate the divergence among the 143 genomes, we first generated a phylogenetic tree using the 18 !" method at k = 26 (Supplementary Fig. S6). Because !" is a simple dissimilarity measure, the branch lengths of this tree indicate divergence (although not directly interpretable) of the species. The tree inferred by !" (Supplementary Fig. S6) shows that Archaea are separated (with long branch lengths within clade) from the Bacteria. The internal branch lengths are relatively shorter than the external branches, suggesting that multiple speciation events (i.e. an adaptive radiation process) occurred in a short timeframe after the establishment of corresponding niches (e.g. bacterial groupings).

We also compute the percentage of shared k-mers %& for each genome pair across all empirical dataset

(Supplementary Table S1), here with k =12. We calculate this percentage %& as

( × 100 % = & , where S is the sum of occurrences of shared k-mers in both genomes, and T is the sum of all possible k-mers in both genomes (i.e. - − / + 1, with g = genome length); a small value of %& indicates high

184 divergence among the genomes. If this k is too small all the genomes tend to share almost all k-mers, whereas if k is too large too few k-mers are shared, particularly on the 143-genome set. At %&12"we observe an appropriate range of similarities across the different datasets. The 143-genome set is the most divergent (%&12" = 15.94%), followed by Yersinia (%&12" = 61.39%) and E.coli-Shigella (%&12" = 64.06%).

Supplementary Table S1. The mean %&12" value (and standard deviation) for each empirical dataset, with the minimum and maximum genome pairs.

Average Genome Mean ± Minimum Pk Maximum Pk Dataset length (Mb) SD (taxa) (taxa) ± SD Yersinia 4.634 ± 61.39 ± 40.52 72.43 genomes 0.080 8.00 (Y .pseudotuberculosis (Y. pseudotuberculosis IP32953 and IP31758) IP32953 and Y. pestis pestoidF)

E.coli/Shi 4.906 ± 64.06 ± 37.15 94.51 gella 0.294 10.58 (E. coli HS and (E. coli S88 and APEC_01) genomes ATCC_9739)

143 3.011 ± 15.94 ± 0.42 99.45 genomes 1.802 10.44 (Wigglesworthia and (Chlamydophila pneumoniae Streptomyces coelicolor) TW 183 and CWL029)

Selection of jackknife rate for pseudo-replicates generation

Previous studies suggested that a jackknife “rate” (proportion) r = 37.5-40% is optimal for the jackknife technique19,20, but this was based on aligned sequences or gene-order data. To determine which r was optimal for our jackknife (JK) analysis we used a range of 20-80% cut-off to generate 100 pseudo-replicates for each empirical dataset. We followed the same technique described in the main Methods to generate JK support values at each rate; the results are shown in Supplementary Fig. S8. The JK values decrease as r increases across the three empirical datasets, and across two different sizes of k for the Yersinia genomes. With a r below 40% we observed for all trees a mean JK value above 90%, with an almost perfect value of 100% for the E. coli/Shigella and Yersinia (at k=9) datasets. At r higher than 50-% we observed large variation for the JK values across all empirical datasets. At r = 40-50% we observed different levels of variation for the JK values based on the datasets: the JK values are above 90% for the E. coli/Shigella and Yersinia (at k=9) datasets, and above 80% for the 143-genome and Yersinia datasets (with a larger distribution for the 143-genome). Our finding suggests that r between 30-60% would give the best dynamic range of support values across different datasets; for that reason we decided to used 40% for our analysis.

Assessment of jackknife pseudo-replicates in different AF methods

185 To assess the robustness of jackknife support in each of the AF methods, we independently accessed topological difference, measured as RF (normalised Robinson-Foulds distance), between (a) a supertree that is summarised from 100 trees (independently generated from each JK pseudo-replicate of the same data), and (b) the tree generated using the AF method from the original (non-jackknifed) data (144 prokaryote genomes). Here we used three supertree methods: the maximum representation of parsimony (MRP) 21, the fast subtree Prune-and-Regraph (fast-SPR) 22 and the extended majority rule (exMR) 23; the results are shown in Supplementary Table S2. We used the R package phytools 24 to generate the MRP supertree, SPRSupertrees (kiwi.cs.dal.ca/Software/SPRSupertrees) to generate the fast-SPR supertree, and the consense as implemented in PHYLIP v3.69 to generate the exMR supertree.

# We observed the lowest RF for !" (RF = 0.04) and the highest for kmacs (RF = 0.45); there was little variation among the supertrees as generated using each of the three methods. Our results suggest that # some AF methods (particularly !" , ffp and cvt) are more robust to data truncation and therefore more appropriate for the calculation of jackknife support.

Supplementary Table S2. RF between the tree generated using each AF method and each of the three supertrees generated using the corresponding AF method, summarised from 100 JK pseudo- replicates.

AF method Supertree method MRP fast-SPR ExMR # !" (k=16) 0.04 0.04 0.04 ffp (k=16) 0.09 0.11 0.09 cvt (k=18) 0.07 0.16 0.14 gram 0.19 0.15 0.15 spaced (n=60) 0.24 0.24 0.24 co-phylog (K=8) 0.29 0.29 0.33 kr 0.37 0.44 0.41 kmacs (mm=12) 0.44 0.45 0.45

186 NitrosomonaseuropaeaATCC19718 oransC2A v

olcaniumGSS1

v

Co XylellafastidiosaTemecula1

x iellaburnetiiRSA493 K1 x

Xylellafastidiosa9a5c

MethanosarcinamazeiGo1 Methanosarcinaaceti Chromobacterium ArchaeoglobusfulgidusDSM4304 PyrococcusabyssiGE5

PyrococcusfuriosusDSM3638 BordetellapertussisTohama NeisseriameningitidisMC58 Thermoplasma ThermoplasmaacidophilumDSM1728 PyrococcushorikoshiiOT3 RalstoniasolanacearumGMI1000NeisseriameningitidisZ2491 SulfolobussolfataricusP2Sulfolobustokodaiistr.7 1 MethanopyruskandleriAV19

PseudomonassyringaepBordetellaparapertussis12822 Aeropyrumperni Xanthomonascampestrisp BordetellabronchisepticaRB50 Methanothermobacterthermautotrophicusstr.DeltaH Pyrobaculumaerophilumstr.IM2

PseudomonasaeruginosaPA01 v Xanthomonasa iolaceumATCC12472 NanoarchaeumequitansKin4-MHalobacteriumsp.NRC-1

x PseudomonasputidaKT2440 v.campestrisstr.ATCC33913onopodisp v.tomatostr.DC3000 MethanocaldococcusjannaschiiDSM2661

viaeGPIC VibriocholeraeO1bio v.citristr.306 ChlamydophilapneumoniaeCWL029 ShewanellaoneidensisMR-1 ChlamydiatrachomatisD/UW-3/CXChlamydophilapneumoniaeTW-183RhodopirellulabalticaSH1 ChlamydiamuridarumNigg VibrioparahaemolyticusRIMD22106331 ChlamydophilapneumoniaeJ138 Chlamydophilaca ChlamydophilapneumoniaeAR39 varElTorstr.N16961 Pasteurellamultocidasubsp.multocidastr.Pm70 Rickettsiaprowazekiistr.MadridE Vibrio Rickettsiaconoriistr.Malish7 Vibrio vulnificusYJ016 MesorhizobiumlotiMAFF303099BradyrhizobiumjaponicumUSDA110 HaemophilusinfluenzaeRdKW20vulnificusCMCP6 Photorhabdusluminescenssubsp.laumondiiTTO1 CaulobactercrescentusCB15 Haemophilusducreyi35000HP Brucellasuis1330 Brucellamelitensisbv.1str.16M YersiniapestisCO92 Agrobacteriumfabrumstr.C58circular YersiniapestisKIM10+ Sinorhizobiummeliloti1021 ChlorobiumtepidumTLS Shigellaflexneri2astr.301 HelicobacterhepaticusATCC51449 Shigellaflexneri2astr.2457T HelicobacterpyloriJ99 EscherichiacoliCFT073 Helicobacterpylori26695 Campylobacterjejunisubsp.jejuniNCTC11168 ATCC700819 EscherichiacoliO157H7EDL933 WolinellasuccinogenesDSM1740 EscherichiacoliO157H7str.Sakai Fusobacteriumnucleatumsubsp.nucleatumATCC25586

Escherichiacolistr.K-12substr.MG1655varTyphistr.CT18 MycoplasmapulmonisUABCTIP varTyphistr.Ty2 MycoplasmapenetransHF-2 Leptospirainterroganssero Salmonellaentericasubsp.entericasero vipalpis BorreliaburgdorferiB31 varTyphimuriumstr.LT2 Porphyromonasgingi Salmonellaentericasubsp.entericasero varlaistr.56601 Bacteroidesthetaiotao valisW83 Salmonellaentericasubsp.entericasero BacilluscereusATCC14579 Bacillusanthracisstr.Ames MycoplasmapneumoniaeM129m Buchneraaphidicolastr.Bp Bacillussubtilissubsp.subtilisstr.168 icronVPI-5482 WigglesworthiaglossinidiaendosymbiontofGlossinabreCandidatusBlochmanniafloridanus Mycoplasmagallisepticumstr.RMycoplasmagenitaliumG37 Buchneraaphidicolastr.Sg Ureaplasmapar OceanobacillusiheyensisHTE831BacillushaloduransC-125 Buchneraaphidicolasp.APS DeinococcusradioduransR11 vumsero Nostocsp.PCC7120 Synechocystissp.PCC6803 var3str.ATCC700970 Treponemapallidumsubsp.pallidumstr.Nichols

CorynebacteriumdiphtheriaeNCTC13129 CorynebacteriumglutamicumATCC13032CorynebacteriumefficiensYS-314 Aquife ThermotogamaritimaMSB8

BifidobacteriumlongumNCC2705 iolaceusPCC7421 ClostridiumtetaniE88 6 StreptomycescoelicolorA3 v 1 MycobacteriumtuberculosisH37Rvx Streptomycesa aeolicusVF5 3 MycobacteriumtuberculosisCDC1551

M MycobacteriumbovisAF2122/97

E MycobacteriumlepraeTN

Clostridiumperfringensstr.13 N

Gloeobacter Synechococcussp.WH8102 e

Tropherymawhippleistr.Twist

a i

ListeriainnocuaClip11262 t

TropherymawhippleiTW08/27

c a

Prochlorococcusmarinussubsp.marinusstr.CCMP1375Prochlorococcusmarinusstr.MIT9313 l a v

ThermoanaerobactertengcongensisMB4LactobacillusplantarumWCFS1ListeriamonocytogenesEGD-e g

ThermosynechococcuselongatusBP-1 ClostridiumacetobutylicumATCC824 ermitilisMA-4680 a

Prochlorococcusmarinussubsp.pastorisstr.CCMP1986 s

u c

EnterococcusfaecalisV583 c

StreptococcusmutansUA159

o

StreptococcuspyogenesSSI-1

c

o

t

StreptococcuspneumoniaeR6 StreptococcuspyogenesMGAS315

p

e StreptococcuspyogenesMGAS8232 Lactococcuslactissubsp.lactisIl1403

Staphylococcusaureussubsp.aureusMW2 r t

StreptococcuspneumoniaeTIGR4 S StaphylococcusepidermidisATCC12228 Streptococcusagalactiae2603V/R Staphylococcusaureussubsp.aureusN315

Staphylococcusaureussubsp.aureusMu50 StreptococcuspyogenesM1GASstr.SF370

Supplementary Figure S6: Phylogenetic tree of 143 prokaryote genomes generated using at k = 26.

187 Supplementary Figure S7: Whole-genome alignment of eight genomes of Yersinia using progressiveMauve.

188 a) 143 Prokaryote genomes b) 27 E.coli/Shigella genomes 100 100

80 80

60

JK JK 60 40

20 40

0 20 30 40 50 60 70 80 20 30 40 50 60 70 80 ρ ρ

c) 8 Yersinia genomes (k=7) d) 8 Yersinia genomes (k=9) 100 100

90 80

80

JK 60 JK 70

40 60

50 20 30 40 50 60 70 80 20 30 40 50 60 70 80 ρ ρ

Supplementary Figure S8: Boxplots of jackknife values (JK) across jackknife rates (ρ) for four AF-trees generated using . Each box depicts the interquartile range between the first and third quartile, with median identified by a line in the box.

Supplementary References

1 Wan, L., Reinert, G., Sun, F. & Waterman, M. S. Alignment-free sequence comparison (II): theoretical power of comparison statistics. J Comput Biol 17, 1467-1490, doi:10.1089/cmb.2010.0056 (2010).

2 Chan, C. X., Bernard, G., Poirion, O., Hogan, J. M. & Ragan, M. A. Inferring phylogenies of evolving sequences without multiple sequence alignment. Sci Rep 4, 6504, doi:10.1038/srep06504 (2014).

189 3 Jun, S. R., Sims, G. E., Wu, G. A. & Kim, S. H. Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution. Proc Natl Acad Sci U S A 107, 133-138, doi:10.1073/pnas.0913033107 (2010).

4 Sims, G. E., Jun, S. R., Wu, G. A. & Kim, S. H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Natl Acad Sci U S A 106, 2677-2682, doi:10.1073/pnas.0813249106 (2009).

5 Sims, G. E. & Kim, S. H. Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs). Proc Natl Acad Sci U S A 108, 8329-8334, doi:10.1073/pnas.1105168108 (2011).

6 Wang, H., Xu, Z., Gao, L. & Hao, B. A fungal phylogeny based on 82 complete genomes using the composition vector method. BMC Evol Biol 9, 195, doi:10.1186/1471-2148- 9-195 (2009).

7 Yi, H. & Jin, L. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res 41, e75, doi:10.1093/nar/gkt003 (2013).

8 Leimeister, C. A., Boden, M., Horwege, S., Lindner, S. & Morgenstern, B. Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics 30, 1991- 1999, doi:10.1093/bioinformatics/btu177 (2014).

9 Leimeister, C. A. & Morgenstern, B. Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics 30, 2000-2008, doi:10.1093/bioinformatics/btu331 (2014).

10 Russell, D. J., Way, S. F., Benson, A. K. & Sayood, K. A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences. BMC Bioinformatics 11, 601, doi:10.1186/1471-2105-11-601 (2010).

11 Haubold, B. Alignment-free phylogenetics and population genetics. Brief Bioinform 15, 407-418, doi:10.1093/bib/bbt083 (2014).

12 Ulitsky, I., Burstein, D., Tuller, T. & Chor, B. The average common substring approach to phylogenomic reconstruction. J Comput Biol 13, 336-350, doi:10.1089/cmb.2006.13.336 (2006).

13 Gusfield, D. Algorithms on strings, trees and sequences: computer science and computational biology. (Cambridge university press, 1997).

190 14 Haubold, B., Pierstorff, N., Moller, F. & Wiehe, T. Genome comparison without alignment using shortest unique substrings. BMC Bioinformatics 6, 123, doi:10.1186/1471- 2105-6-123 (2005).

15 Horwege, S. et al. Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res 42, W7-11, doi:10.1093/nar/gku398 (2014).

16 Darling, A. E., Mau, B. & Perna, N. T. progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5, e11147, doi:10.1371/journal.pone.0011147 (2010).

17 Darling, A. C., Mau, B., Blattner, F. R. & Perna, N. T. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 14, 1394-1403, doi:10.1101/gr.2289704 (2004).

18 Reinert, G., Chew, D., Sun, F. & Waterman, M. S. Alignment-free sequence comparison (I): statistics and power. J Comput Biol 16, 1615-1634, doi:10.1089/cmb.2009.0198 (2009).

19 Lapointe, F. J., Kirsch, J. A. & Bleiweiss, R. Jackknifing of weighted trees: validation of phylogenies reconstructed from distance matrices. Mol Phylogenet Evol 3, 256-267, doi:10.1006/mpev.1994.1028 (1994).

20 Shi, J., Zhang, Y., Luo, H. & Tang, J. Using jackknife to assess the quality of gene order phylogenies. BMC Bioinformatics 11, 168, doi:10.1186/1471-2105-11-168 (2010).

21 Ragan, M. A. Phylogenetic inference based on matrix representation of trees. Mol Phylogenet Evol 1, 53-58, doi:10.1016/1055-7903(92)90035-F (1992).

22 Whidden, C., Zeh, N. & Beiko, R. G. Supertrees Based on the Subtree Prune-and- Regraft Distance. Syst Biol 63, 566-581, doi:10.1093/sysbio/syu023 (2014).

23 Margush, T. & McMorris, F. R. Consensus-trees. Bulletin of Mathematical Biology 43, 239-244 (1981).

24 Revell, L. J. phytools: an R package for phylogenetic comparative biology (and other things). Methods in Ecology and Evolution 3, 217-223, doi:10.1111/j.2041- 210X.2011.00169.x (2012).

191 APPENDIX C K-mer similarity, networks of microbial genomes and taxonomic rank

Supplementary Figures

# Supplementary Figure S1: P- network of prokaryote phyla using !" with k=25, based on rRNAs. Edges represent connections between isolates of two phyla. The node size is proportional to the number of isolates in a phylum. Distance threshold = 6.

192 Figure S2

Supplementary Figure S2: PCA analysis performed on the raw data of the COG categories profile for each genus. Each phylum is color-coded.

Figure S3

Supplementary Figure S3: PCA analysis performed on the raw data of the COG categories profile for each genus. Each genus is color-coded according to the number of isolates.

193

Figure S4

Supplementary Figure S4: PCA analysis performed on the normalised counts of center-scaled COG categories. Each phylum is color-coded.

194 Mer25

BasesPacked BINARY(8)

MerMetadata BINARY(8) RNAMers

MerSize SMALLINT

SpeciesNo SMALLINT

BasesPacked BINARY(8) SpeciesNames GeneSeqs Indexes SpeciesUid INT GeneKey INT IX_RNAMers SpeciesID VARCHAR(100) Gene LONGTEXT GenusID VARCHAR(50) Indexes SubfamilyID VARCHAR(50) PRIMARY Species FamilyID VARCHAR(50) SpeciesNo INT

SuborderID VARCHAR(50) SpeciesID VARCHAR(100)

OrderID VARCHAR(50) SpeciesUid INT

SubclassID VARCHAR(50) Genes Finished INT ClassID VARCHAR(50) GeneKey INT Indexes

PhylumID VARCHAR(50) SpeciesNo INT PRIMARY SequenceToSequenceMatches DomainID VARCHAR(50) SequenceNo INT BySpeciesID MerSize SMALLINT Indexes GeneNo INT SourceSpeciesNo INT PRIMARY GenePID VARCHAR(20) Sequences SourceSequenceNo INT IX_SpeciesNameByID GeneName VARCHAR(20) SpeciesNo INT TargetSpeciesNo INT GeneSynonym VARCHAR(20) SequenceNo INT TargetSequenceNo INT GeneCode VARCHAR(20) SequenceID VARCHAR(20) MerMatches INT GenomeToGenomeMatches GeneCOG VARCHAR(20) SequenceDesc VARCHAR(100) MersMatchesNonrRNA INT MerSize SMALLINT GeneProduct VARCHAR(200) SequenceLength INT MerMatchesPC REAL SourceSpeciesNo INT GeneRNA CHAR(3) Indexes MersMatchesNonrRNAPC REAL TargetSpeciesNo INT GeneGC INT PRIMARY Indexes MerMatches INT GeneStrand CHAR(1) BySequenceID PRIMARY MersMatchesNonrRNA INT GeneStart INT GenomeToGeneMatchesByTarget MerMatchesPC REAL GeneEnd INT MersMatchesNonrRNAPC REAL GeneLength INT Indexes GeneToGeneMatches Indexes PRIMARY MerSize SMALLINT PRIMARY GenomeToGenomeMatchesByTarget SourceSpeciesNo INT ByGenePID SourceSequenceNo INT ByGeneLocation SourceGeneNo INT

TargetSpeciesNo INT

TargetSequenceNo INT

TargetGeneNo INT

MerMatches INT Indexes

PRIMARY

GeneToGeneMatchesByTarget

Supplementary Figure S5: Relational diagram of the SQL k-mer database.

Supplementary Tables

Supplementary Table S1: List of the 2785 isolates used in this analysis.

The Supplementary Table S1 can be download at http://biorxiv.org/content/early/2017/04/07/125237.figures-only

Supplementary Table S2: Network analysis of the I-network for 2705 complete genomes of bacteria and archaea.

195 All included

Threshol Nodes Edges Edges per Maximum Maximal clique d connected node clique size enumeration

0 2705 383507 1417.8 2700 10 0

1 2705 379793 1404.0 2445 N/A 2

2 2705 199176 736.3 860 N/A 9

3 2680 302327 112.8 339 1662785

4 2378 72076 30.3 211 6181

5 2091 32344 15.5 124 3344

6 1860 19374 10.4 82 525

7 1676 14831 8.8 64 229

8 1538 13287 8.6 61 224

9 1358 9898 7.3 48 232

Supplementary Table S3: Network analysis of the I-network for 2616 genomes of bacteria and archaea, with rRNA genes removed.

No RNA

Threshol Nodes Edges Edges per Maximum Maximal clique d connected node clique size enumeration

0 2615 172008 657.8 1226 N/A 2

1 2597 766363 295.1 548 N/A

2 2555 253871 99.4 367 164221

3 2394 93069 38.9 220 5379

4 2182 43561 20.0 159 5139

5 1959 24514 12.5 117 631

196 6 1761 16960 9.6 74 299

7 1591 13638 8.6 62 120

8 1460 12199 8.4 59 117

9 1290 9008 7.0 47 131

Supplementary Table S4: Network analysis of the rRNA gene sequences I-network of 2616 bacterial and archaeal isolates.

RNA only

Threshol Nodes Edges Edges per Maximum Maximal clique d connected node clique size enumeration

0 2616 324589 1240.8 2356 N/A 1

1 2616 323288 1235.8 2356 N/A 5

2 2616 319509 1221.4 2344 N/A 7

3 2616 312657 1195.2 2332 N/A 1

4 2616 308673 1179.9 2291 N/A 1

5 2616 295755 1130.6 2045 N/A 5

6 2613 223261 854.4 1321 N/A 4

7 2597 548474 211.2 530 N/A

8 2509 191491 76.3 299 45272

9 2162 45916 21.2 185 289

Supplementary Table S5: Network analysis of the plasmid genomes I-network of 921 bacterial and plasmid genomes.

197 Plasmid

Threshol Nodes Edge Edges per Maximum Maximal clique d connected s node clique size enumeration

0 745 1067 14.3 48 20557 9

1 718 9025 12.6 46 13272

2 680 7507 11.0 45 3925

3 648 6391 9.9 39 1406

4 601 5240 8.7 34 800

5 556 4266 7.7 30 589

6 499 3167 6.3 25 368

7 439 2000 4.6 13 122

8 353 1269 3.6 11 26

9 245 991 4.0 9 14

Supplementary Table S6: Statistics of core k-mers for 151 genera.

Genera Number of distinct Number of Number of core k- core k-mers isolates mer / isolates

Acetobacter 2042974 9 226997.11

Acholeplasma 1526 3 508.67

Achromobacter 87624 3 29208.00

Acidithiobacillus 3158 4 789.50

Acidovorax 9544 5 1908.80

Acinetobacter 5990 19 315.26

Actinobacillus 4920 5 984.00

Actinoplanes 20813 4 5203.25

Aeromonas 24147 4 6036.75

198 Aggregatibacter 18551 4 4637.75

Agrobacterium 5343 4 1335.75

Alteromonas 5742 14 410.14

Amycolatopsis 113989 5 22797.80

Anabaena 4613 3 1537.67

Anaeromyxobacter 31770 4 7942.50

Anaplasma 45 9 5.00

Archaeoglobus 32 4 8.00

Arcobacter 6665 5 1333.00

Arthrobacter 2775 6 462.50

Azospirillum 28797 3 9599.00

Azotobacter 5166237 3 1722079.00

Bartonella 2154 9 239.33

Bdellovibrio 2278 3 759.33

Blochmannia 1625 4 406.25

Bordetella 17 10 1.70

Brachyspira 242 7 34.57

Bradyrhizobium 825 5 165.00

Brucella 6507 20 325.35

Buchnera 993 13 76.38

Caldicellulosiruptor 2366 8 295.75

Carnobacterium 7435 3 2478.33

Carsonella 572 7 81.71

Caulobacter 16487 4 4121.75

Chlamydia 391 86 4.55

Chloroflexus 9267 3 3089.00

199 Clavibacter 245101 3 81700.33

Corynebacterium 605 51 11.86

Coxiella 104030 5 20806.00

Cronobacter 10482 5 2096.40

Cupriavidus 65493 3 21831.00

Cyanothece 833 6 138.83

Dehalococcoides 31804 8 3975.50

Deinococcus 792 7 113.14

Desulfitobacterium 15811 4 3952.75

Desulfosporosinus 7832 3 2610.67

Desulfotomaculum 1022 6 170.33

Desulfovibrio 265 14 18.93

Desulfurococcus 1497 3 499.00

Dickeya 7131 4 1782.75

Edwardsiella 643366 4 160841.50

Ehrlichia 187 6 31.17

Enterobacter 2957 11 268.82

Enterococcus 2532 13 194.77

Erwinia 15869 7 2267.00

Escherichia 5893 64 92.08

Exiguobacterium 5335 4 1333.75

Flavobacterium 1303 5 260.60

Francisella 383 19 20.16

Frankia 178 5 35.60

Gardnerella 368 3 122.67

Geobacillus 5271 11 479.18

200 Geobacter 1138 9 126.44

Glaciecola 2150 3 716.67

Gluconacetobacter 8177 3 2725.67

Gordonia 17512 3 5837.33

Haloarcula 564756 3 188252.00

Hydrogenobaculum 482260 3 160753.33

Hyphomicrobium 1779 4 444.75

Klebsiella 7918 12 659.83

Lactobacillus 515 57 9.04

Lactococcus 4243 13 326.38

Legionella 4291 13 330.08

Leptospira 60 7 8.57

Leuconostoc 2725 8 340.63

Liberibacter 3637 4 909.25

Listeria 9012 34 265.06

Mannheimia 4585 8 573.13

Marinobacter 4554 4 1138.50

Marinomonas 5524 3 1841.33

Meiothermus 21977 3 7325.67

Mesorhizobium 7575 4 1893.75

Methanobacterium 1194 3 398.00

Methanobrevibacter 1067 3 355.67

Methanocaldococcus 3459 5 691.80

Methanocella 1105 3 368.33

Methanosaeta 711 3 237.00

Methanosarcina 7548 4 1887.00

201 Methylobacterium 2433 8 304.13

Myxococcus 72867 3 24289.00

Neisseria 4303 18 239.06

Nitrosococcus 3395 3 1131.67

Nitrosomonas 1435 4 358.75

Nitrosopumilus 7442 3 2480.67

Nocardia 36570 3 12190.00

Nostoc 3218 4 804.50

Oligotropha 6700 3 2233.33

Paenibacillus 2784 11 253.09

Pantoea 5889 6 981.50

Pasteurella 201523 4 50380.75

Pectobacterium 76126 5 15225.20

Pediococcus 8560 3 2853.33

Phaeobacter 79509 3 26503.00

Porphyromonas 1516 4 379.00

Portiera 2291 5 458.20

Prevotella 674 6 112.33

Prochlorococcus 1564 12 130.33

Propionibacterium 718 14 51.29

Pseudoalteromonas 2877 3 959.00

Psychrobacter 7024 4 1756.00

Pyrobaculum 75 7 10.71

Pyrococcus 201 7 28.71

Rahnella 1053800 3 351266.67

Ralstonia 53 11 4.82

202 Rhizobium 5011 9 556.78

Rhodobacter 3139 5 627.80

Rhodococcus 6153 6 1025.50

Rhodopseudomonas 2661 7 380.14

Rhodospirillum 1410 4 352.50

Riemerella 9742 5 1948.40

Roseburia 14944 3 4981.33

Ruminococcus 23 4 5.75

Salmonella 2254 46 49.00

Serratia 1275 11 115.91

Shewanella 2846 24 118.58

Shigella 336980 10 33698.00

Sinorhizobium 20194 10 2019.40

Sphingobium 2129 3 709.67

Spiroplasma 129 5 25.80

Staphylococcus 2002 60 33.37

Stenotrophomonas 384359 4 96089.75

Streptococcus 1 123 0.01

Streptomyces 4 19 0.21

Sulcia 3978 5 795.60

Sulfolobus 20 17 1.18

Synechocystis 3526700 6 587783.33

Taylorella 7201 5 1440.20

Thermoanaerobacter 2274 8 284.25

Thermoanaerobacterium 39395 3 13131.67

Thermococcus 8 9 0.89

203 Thermus 236 7 33.71

Thioalkalivibrio 558 3 186.00

Tremblaya 356 3 118.67

Treponema 208 17 12.24

Ureaplasma 141564 3 47188.00

Variovorax 232615 3 77538.33

Vibrio 3081 24 128.38

Wolbachia 45 7 6.43

Xanthomonas 66 15 4.40

Xylella 627213 5 125442.60

Yersinia 7235 19 380.79

Zymomonas 26 6 4.33

204

Supplementary Table S7: COG category profiles for 16 phyla.

Phyla Genera Nbr of Isolates A B C D E F G H I J K L M N O P Q R S T U V W Y Z

Actinobacteria Clavibacter 150 2 0 46 11 94 40 96 52 34 39 41 54 51 1 40 77 24 123 102 34 21 13 1 0 0

Actinobacteria Nocardia 115 1 0 69 11 87 29 50 54 33 45 30 43 36 0 35 41 19 78 50 22 15 8 0 0 0

Actinobacteria Actinoplanes 114 0 0 45 8 38 20 37 22 10 22 22 26 17 0 19 28 6 38 19 20 7 5 0 0 0

Actinobacteria Amycolatopsis 102 1 0 91 13 124 49 94 66 40 38 58 61 59 1 51 74 35 157 110 36 15 15 0 0 0

Actinobacteria Gordonia 61 0 0 36 7 49 13 28 31 27 29 17 24 19 0 26 23 9 35 26 11 5 5 0 0 0

Actinobacteria Rhodococcus 35 0 0 10 3 9 3 6 1 7 6 6 6 4 0 7 2 3 4 1 3 1 0 0 0 0

Actinobacteria Frankia 33 0 0 3 1 1 1 6 1 4 2 3 3 5 0 1 2 4 5 4 4 1 2 0 0 0

Actinobacteria Arthrobacter 25 0 0 6 0 9 0 4 2 1 6 1 1 0 0 3 8 1 2 1 1 0 2 0 0 0

Actinobacteria Streptomyces 16 0 0 0 0 7 0 1 1 0 0 0 0 0 0 0 6 0 0 0 0 0 3 0 0 0

Actinobacteria Gardnerella 2 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0

Alphaproteobacteria Azospirillum 107 0 0 68 15 76 12 47 28 20 17 18 38 42 12 33 45 12 54 49 52 20 14 0 0 0

Alphaproteobacteria Acetobacter 82 1 0 106 20 135 52 73 98 40 54 44 79 88 1 77 81 20 143 157 30 29 16 1 0 0

Alphaproteobacteria Phaeobacter 82 0 0 10 3 44 3 10 15 10 7 15 20 8 8 13 16 4 15 7 7 5 2 0 0 0

Alphaproteobacteria Caulobacter 73 1 0 36 6 38 6 19 21 25 21 10 20 22 4 22 19 14 31 28 16 16 6 0 0 0

Alphaproteobacteria Mesorhizobium 71 0 0 14 2 30 4 14 14 5 5 9 13 12 0 12 23 6 30 13 11 2 6 0 0 0

Alphaproteobacteria Sinorhizobium 64 0 0 53 5 67 16 45 27 17 19 17 18 12 8 24 34 7 66 36 18 19 7 0 0 0

Alphaproteobacteria Bradyrhizobium 62 0 0 2 0 13 0 7 5 1 1 2 12 4 0 7 17 1 17 6 15 2 6 0 0 0

Alphaproteobacteria Oligotropha 57 0 0 4 0 14 1 9 3 1 3 6 20 7 0 4 22 2 16 9 11 3 5 0 0 0

Alphaproteobacteria Rhodobacter 55 0 0 32 4 27 13 26 16 11 7 6 14 12 1 10 18 4 15 7 2 4 5 0 0 0

Alphaproteobacteria Sphingobium 50 0 0 22 2 13 4 6 8 12 8 1 7 5 0 22 8 1 7 5 9 6 3 0 0 0

205 Alphaproteobacteria Rhodopseudomonas 43 0 0 2 1 8 1 1 3 2 2 0 7 4 0 4 8 1 11 5 7 2 5 0 0 0

Alphaproteobacteria Rhizobium 39 0 0 11 2 13 1 4 7 1 4 2 4 0 0 8 11 1 9 0 4 0 8 0 0 0

Alphaproteobacteria Methylobacterium 33 0 0 12 2 5 2 6 3 0 3 0 1 7 0 5 4 1 7 4 9 2 2 0 0 0

Alphaproteobacteria Agrobacterium 30 0 0 17 1 15 4 3 4 1 4 1 7 2 0 5 11 1 9 1 3 3 5 0 0 0

Alphaproteobacteria Rhodospirillum 29 0 0 3 2 6 5 1 1 4 0 0 2 1 0 0 9 0 2 1 5 0 2 0 0 0

Alphaproteobacteria Brucella 26 0 0 6 0 10 2 14 6 5 3 7 12 6 0 6 11 0 15 9 3 1 5 0 0 0

Alphaproteobacteria Gluconacetobacter 21 0 0 16 1 8 3 12 11 7 7 8 9 7 0 5 11 2 11 8 3 8 3 0 0 0

Alphaproteobacteria Hyphomicrobium 6 0 0 4 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

Alphaproteobacteria Ehrlichia 2 0 0 1 0 1 0 2 4 1 1 1 0 0 0 2 1 0 0 1 0 2 0 0 0 0

Alphaproteobacteria Wolbachia 2 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Alphaproteobacteria Bartonella 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Alphaproteobacteria Zymomonas 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Bacteroidetes Porphyromonas 6 0 0 0 0 1 1 0 0 0 1 1 0 0 0 4 0 0 0 2 0 0 0 0 0 0

Bacteroidetes Flavobacterium 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0

Bacteroidetes Riemerella 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Betaproteobacteria Variovorax 184 1 0 126 20 163 48 109 78 52 72 54 82 92 5 73 112 41 230 204 59 50 19 0 0 0

Betaproteobacteria Achromobacter 115 0 0 58 6 81 22 42 34 35 30 35 30 52 20 48 62 28 100 97 27 33 13 0 0 0

Betaproteobacteria Cupriavidus 110 1 0 126 16 130 44 53 77 41 80 36 56 67 12 57 77 23 134 111 43 41 16 0 0 0

Betaproteobacteria Acidovorax 38 0 0 34 4 40 11 9 13 11 12 7 14 7 5 10 18 8 19 17 14 18 4 0 0 0

Betaproteobacteria Neisseria 25 0 0 2 1 4 0 3 4 1 2 1 14 11 0 3 8 3 8 9 1 4 3 0 0 0

Betaproteobacteria Bordetella 14 0 0 0 0 3 0 1 1 0 1 0 0 0 0 1 6 0 6 0 0 0 1 0 0 0

Betaproteobacteria Ralstonia 6 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0

Betaproteobacteria Taylorella 6 0 0 14 2 13 1 3 3 1 8 3 5 7 0 4 8 2 10 13 5 2 2 0 0 0

206 Betaproteobacteria Nitrosomonas 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Chloroflexi Chloroflexus 12 0 0 35 3 31 4 36 19 10 6 11 17 13 0 16 19 8 38 15 10 7 5 0 0 0

Chloroflexi Dehalococcoides 7 0 1 42 7 63 35 19 40 17 28 24 43 14 0 24 21 5 57 37 9 15 2 0 0 0

Crenarchaeota Desulfurococcus 2 0 0 4 0 7 2 3 1 0 5 4 7 0 0 4 2 0 15 3 2 1 0 0 0 0

Cyanobacteria Synechocystis 223 0 0 67 6 76 29 52 43 25 29 17 34 36 0 33 50 13 63 51 24 16 9 0 0 0

Cyanobacteria Anabaena 8 0 0 14 3 11 3 9 9 2 4 3 5 5 0 3 12 6 17 11 3 1 2 0 0 0

Cyanobacteria Nostoc 6 0 0 3 0 3 1 3 2 1 0 2 1 0 0 0 5 1 5 0 0 0 0 0 0 0

Deinococcus- Meiothermus 36 0 0 15 4 14 5 10 7 7 6 5 17 3 1 7 18 5 22 14 6 6 5 0 0 0 Thermus

Deinococcus- Thermus 13 0 0 2 2 4 0 3 1 1 1 1 3 1 0 0 2 1 2 0 0 0 2 0 0 0 Thermus

Deltaproteobacteria Myxococcus 89 1 0 45 12 60 31 34 42 35 33 28 61 67 3 43 49 15 105 65 50 56 9 0 0 1

Deltaproteobacteria Anaeromyxobacter 74 0 0 77 14 62 27 44 35 32 29 28 53 49 0 48 44 15 91 58 44 36 10 0 0 0

Deltaproteobacteria Bdellovibrio 3 0 0 9 0 1 0 2 0 0 2 1 4 1 0 0 1 0 3 0 2 0 1 0 0 0

Epsilonproteobacteri Arcobacter 4 0 0 13 0 10 5 3 11 3 9 7 7 4 2 6 2 0 7 5 6 4 0 0 0 0 a

Euryarchaeota Haloarcula 91 0 2 85 11 112 44 55 91 36 55 45 67 24 5 52 74 18 196 174 35 20 13 0 0 0

Euryarchaeota Methanosarcina 3 0 0 12 0 9 5 3 20 1 9 8 16 2 0 5 2 3 13 15 4 1 1 0 0 0

Euryarchaeota Methanobrevibacter 1 0 0 1 0 0 1 0 3 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0

Euryarchaeota Methanocaldococcus 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

Euryarchaeota Methanocella 1 0 0 4 0 2 1 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0

Euryarchaeota Methanosaeta 1 0 0 2 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

Euryarchaeota Pyrococcus 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Firmicutes Thermoanaerobacteriu 15 0 0 31 12 47 18 43 20 8 22 28 43 35 8 16 17 4 60 52 21 20 4 0 0 0 m

Firmicutes Carnobacterium 13 0 0 2 3 3 2 9 1 1 10 4 3 0 0 4 1 1 4 3 0 1 0 0 0 0

207 Firmicutes Roseburia 13 0 0 9 1 21 8 33 2 8 8 7 16 8 6 11 9 0 16 10 12 4 5 0 0 0

Firmicutes Desulfitobacterium 8 0 0 38 4 25 9 8 5 6 19 15 13 15 4 7 11 2 27 26 14 8 2 0 0 0

Firmicutes Pediococcus 8 0 0 2 0 4 1 9 0 3 4 4 4 2 0 2 2 0 6 4 1 1 1 0 0 0

Firmicutes Lactococcus 7 0 0 5 0 0 1 8 0 1 5 3 2 1 0 1 1 0 2 1 0 2 0 0 0 0

Firmicutes Desulfosporosinus 6 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 2 0 0 0 0 0 0

Firmicutes Geobacillus 4 0 0 2 1 3 0 0 2 0 1 2 1 3 0 0 0 0 0 1 1 1 0 0 0 0

Firmicutes Caldicellulosiruptor 2 0 0 0 0 2 0 0 0 0 0 2 8 1 0 0 0 0 0 1 0 0 1 0 0 0

Firmicutes Exiguobacterium 2 0 0 4 0 0 1 8 0 0 2 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0

Firmicutes Thermoanaerobacter 2 0 0 1 0 1 0 0 1 0 0 0 5 2 0 0 0 0 1 1 0 0 0 0 0 0

Gammaproteobacter Azotobacter 284 2 0 219 40 225 79 159 136 81 108 90 150 140 26 127 181 62 372 442 105 118 33 0 0 1 ia

Gammaproteobacter Stenotrophomonas 174 2 0 183 36 203 78 109 112 80 96 77 122 145 25 118 138 43 264 283 89 103 26 1 0 0 ia

Gammaproteobacter Edwardsiella 129 1 0 170 35 171 84 142 118 56 182 68 125 137 19 87 124 22 213 239 69 73 19 0 0 0 ia

Gammaproteobacter Rahnella 89 1 0 180 42 234 88 206 136 63 200 90 130 145 22 111 167 30 293 296 79 79 25 1 0 0 ia

Gammaproteobacter Shigella 68 0 0 74 9 101 49 66 62 32 46 43 68 87 1 38 81 25 107 128 49 23 21 1 0 0 ia

Gammaproteobacter Cronobacter 52 0 0 4 0 7 0 9 3 2 4 3 2 6 1 1 15 3 9 7 6 2 4 0 0 0 ia

Gammaproteobacter Pectobacterium 50 1 0 49 14 77 31 84 42 19 116 46 35 49 0 40 55 16 95 88 22 25 7 0 0 0 ia

Gammaproteobacter Xylella 46 1 0 67 17 86 31 43 43 30 37 32 46 66 0 54 40 12 90 84 25 40 15 1 0 0 ia

Gammaproteobacter Aeromonas 32 0 0 24 3 30 11 23 8 3 6 21 12 12 7 15 29 1 30 14 26 21 7 0 0 0 ia

Gammaproteobacter Erwinia 32 0 0 22 3 18 7 12 8 0 78 13 17 4 0 6 15 1 4 13 9 12 1 0 0 0 ia

208 Gammaproteobacter Thioalkalivibrio 29 0 0 5 2 5 3 4 2 1 5 0 2 2 0 1 5 2 1 1 0 2 2 0 0 1 ia

Gammaproteobacter Pasteurella 28 0 0 30 6 20 7 31 8 5 35 10 7 10 0 7 14 4 33 23 6 6 1 1 0 0 ia

Gammaproteobacter Coxiella 20 0 0 4 0 13 4 8 7 6 6 2 19 8 0 4 8 1 15 15 0 9 6 0 0 0 ia

Gammaproteobacter Klebsiella 20 0 0 3 0 1 1 5 1 0 5 1 0 2 0 2 2 0 5 0 2 0 0 0 0 0 ia

Gammaproteobacter Aggregatibacter 17 0 0 21 8 22 7 14 16 4 13 6 20 21 0 8 9 1 25 15 3 14 3 0 0 0 ia

Gammaproteobacter Dickeya 11 0 0 1 0 2 2 3 0 0 6 0 0 0 0 0 2 0 0 3 3 0 0 0 0 0 ia

Gammaproteobacter Alteromonas 10 0 0 27 3 8 7 1 4 5 24 8 8 1 0 9 0 0 17 8 4 4 0 0 0 0 ia

Gammaproteobacter Marinobacter 10 0 0 9 1 6 3 2 4 3 7 3 4 2 0 2 3 1 6 0 2 3 0 0 0 0 ia

Gammaproteobacter Pseudoalteromonas 10 0 0 5 0 0 0 0 1 0 10 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 ia

Gammaproteobacter Salmonella 10 0 0 0 1 1 1 2 2 0 3 1 2 0 0 3 0 0 1 0 1 0 2 0 0 0 ia

Gammaproteobacter Yersinia 10 0 0 0 0 1 1 2 1 0 2 1 0 0 0 0 2 0 1 1 2 0 0 0 0 0 ia

Gammaproteobacter Psychrobacter 7 0 0 11 1 4 4 1 4 3 6 4 0 2 0 5 1 1 0 3 1 0 0 0 0 0 ia

Gammaproteobacter Actinobacillus 6 0 0 3 0 6 1 2 1 2 2 1 3 0 0 2 0 0 2 0 0 0 0 0 0 0 ia

Gammaproteobacter Glaciecola 6 0 0 2 0 0 0 0 1 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ia

Gammaproteobacter Mannheimia 6 0 0 0 0 0 2 0 0 0 5 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ia

Gammaproteobacter Nitrosococcus 6 0 0 6 2 4 2 7 6 0 3 2 11 7 0 4 6 0 3 2 1 3 0 0 0 0 ia

Gammaproteobacter Pantoea 6 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 ia

209 Gammaproteobacter Xanthomonas 6 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 4 0 0 0 0 ia

Gammaproteobacter Marinomonas 3 0 0 6 0 1 0 1 1 1 0 0 1 0 0 1 2 0 1 0 1 2 0 0 0 0 ia

Gammaproteobacter Acidithiobacillus 2 0 0 4 1 3 3 1 0 1 3 3 6 1 0 0 2 0 1 1 2 1 0 0 0 0 ia

Gammaproteobacter Enterobacter 2 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ia

Gammaproteobacter Escherichia 2 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ia

Gammaproteobacter Francisella 1 0 0 0 0 2 0 1 2 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 ia

Gammaproteobacter Serratia 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ia

Gammaproteobacter Shewanella 1 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ia

Spirochaetes Brachyspira 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 2 0 0 0 0

Tenericutes Ureaplasma 34 0 0 17 7 21 19 20 12 8 27 15 44 9 0 18 23 1 40 28 5 11 7 0 0 0

Thaumarchaeota Nitrosopumilus 2 0 1 33 3 24 10 9 27 6 12 11 17 5 0 15 12 2 21 20 1 5 2 0 0 0

210

# Supplementary Table S8: Relationship between !" and the fraction of shared 25-mers.

# !" distance Fraction of shared 25-mers (in %).

0 98.98

1 37.24

2 13.6

3 4.91

4 1.88

5 0.7

6 0.26

7 0.09

8 0.02

9 0.01

10 0.0062

211