SCALABLE ALIGNMENT-FREE APPROACHES IN MICROBIAL PHYLOGENOMICS Guillaume Bernard Bachelor of Cellular Biology & Physiology and Master in Bioinformatics
A thesis submitted for the degree of Doctor of Philosophy at
The University of Queensland in 2017
Institute for Molecular Bioscience
Abstract In the 1970s Carl Woese and colleagues discovered the third domain of life by comparing oligonucleotide catalogs of 16S/18S rRNAs. Four decades later, phylogenetic studies are mostly based on multiple sequence alignment (MSA) approaches. However, genome evolution in microbes involves highly dynamic molecular mechanisms including genome rearrangement and lateral genetic transfer (LGT). These mechanisms can potentially violate the implicit assumption of full-length contiguity in MSA. Furthermore, commonly used MSA-based approaches can necessitate the use of heuristic methods, e.g. Bayesian inference, in reconstructing phylogenies, and these may not be scalable to the quantity of existing and forthcoming genome data. In recent years, alignment-free (AF) methods have been developed as an alternative strategy to infer evolutionary relatedness based on shared subsequences of fixed length, known as k-mers, similarly to Woese’s preliminary work. In this thesis, I aimed to study the complex evolution of microbial genomes with the development of novel AF approaches, and systematic assessment of the AF methods’ potential for phylogenetic inference. This could potentially provide new insight onto microbial evolution and change the way we do phylogenomics, i.e. potentially lead to the development of “next-generation phylogenomics”.
The thesis starts with a brief overview of the diversity of microbial life, and the difficulties in understanding microbial evolution due to complex phenomena such as LGT or rearrangement. I explain how phylogenomic approaches can be used to understand microbial evolution, and describe distinct approaches based on MSA and AF.
The second chapter is a literature review of the conceptual foundations of alignment-free approaches for the inference of phylogenetic relationships of genome sequences. I discuss the limitations of MSA-based approaches, introduce the concept of k-mers, present in detail the different families of alignment-free approaches and describe their applications to infer vertical and lateral phylogenetic signal among microbial genomes.
The three result chapters are presented in the form of research papers, each with its own introduction, methods, results and discussion.
In the first research chapter, I examined the performance of AF approaches in recovering accurate phylogenies of bacterial protein and nucleotide sequences simulated under diverse evolutionary scenarios. I implemented an AF approach to infer phylogenies and compared the robustness of a class of AF methods, the !" statistics, with an MSA-based approach against among-site rate heterogeneity, compositional biases, genetic rearrangements, insertions/deletions, sequence divergence and sequence truncation. I also assessed the scalability of these methods on simulated and empirical data. This work demonstrated that compared to a MSA approach, AF methods are
i more robust against among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions, but are more sensitive to recent sequence divergence and sequence truncation. The AF methods were found to be accurate, scalable and computationally efficient.
In the second research chapter, I systematically assessed the sensitivity and scalability of nine AF methods to genome-scale evolutionary events, including sequence divergence, LGT and rearrangement. The methods selected represent the two families of AF methods, those based on word counts (with exact or inexact k-mers) and those based on match lengths (with or without mismatches). I found that most AF methods are robust against rearrangement and a moderate amount of LGT, and I identified optimal parameters. I also examined the scalability of these methods at genome scale, and found that while remaining fast, their scalability differs between the two families. I also introduced a new application of the jackknife technique to provide node- support values to phylogenies inferred by AF approaches, and showed that these values are biologically meaningful.
# In the third results chapter, I implemented an AF approach (based on the !" statistic) to infer phylogenomic networks for a large dataset of complete genomes of Bacteria and Archaea. I reconstructed a phylogenomic network of microbial life using 2785 completely sequenced bacterial and archaeal genomes, and systematically assessed the impact of ribosomal RNA and plasmid sequences in this network. By implementing and varying a distance threshold, I captured changes in the network structure, e.g. cliques, that reflect the evolutionary dynamics of microbial genomes. I linked the implicated k-mers to annotated genomic regions (thus functions) using a database approach, and defined the term core k-mers. These findings indicate that AF phylogenomics is not limited to tree inference, but can also provide new insight into microbial evolution by combining network analysis and the use of a relational k-mer database in a scalable manner.
ii Declaration by author
This thesis is composed of my original work, and contains no material previously published or written by another person except where due reference has been made in the text. I have clearly stated the contribution by others to jointly-authored works that I have included in my thesis.
I have clearly stated the contribution of others to my thesis as a whole, including statistical assistance, survey design, data analysis, significant technical procedures, professional editorial advice, and any other original research work used or reported in my thesis. The content of my thesis is the result of work I have carried out since the commencement of my research higher degree candidature and does not include a substantial part of work that has been submitted to qualify for the award of any other degree or diploma in any university or other tertiary institution. I have clearly stated which parts of my thesis, if any, have been submitted to qualify for another award.
I acknowledge that an electronic copy of my thesis must be lodged with the University Library and, subject to the policy and procedures of The University of Queensland, the thesis be made available for research and study in accordance with the Copyright Act 1968 unless a period of embargo has been approved by the Dean of the Graduate School.
I acknowledge that copyright of all material contained in my thesis resides with the copyright holder(s) of that material. Where appropriate I have obtained copyright permission from the copyright holder to reproduce material in this thesis.
iii Publications during candidature
G. Bernard, Paul Greenfield, M. A. Ragan and C. X. Chan. K-mer similarity, networks of microbial genomes and taxonomic rank. bioRxiv 125237, DOI: 10.1101/125237, 2017.
G. Bernard, M. A. Ragan. and C. X. Chan. Recapitulating phylogenies using k-mers: from trees to networks. F1000Research 5, 2789, DOI: 10.1038/srep28970, 2016.
G. Bernard, C. X. Chan and M. A. Ragan. Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer, Scientific Reports, 6:28970, DOI: 10.1038/srep28970, 2016.
C. X. Chan, G. Bernard, O. Poirion, J. M. Hogan, and M. A. Ragan. Inferring phylogenies of evolving sequences without multiple sequence alignment. Scientific Reports, 4:6504, DOI: 10.1038/srep06504, 2014.
M. A. Ragan, G. Bernard and C. X. Chan. Molecular phylogenetics before sequences: Oligonucleotide catalogs as k-mer spectra. RNA Biology 11(3):176-185, DOI: 10.4161/rna.27505, 2014.
iv Publications included in this thesis
M. A. Ragan, G. Bernard and C. X. Chan. Molecular phylogenetics before sequences: Oligonucleotide catalogs as k-mer spectra. RNA Biology 11(3):176-185, 2014 - incorporated as part of Chapter 3.
Contributor Statement of contribution
Author Guillaume Bernard (Candidate) Designed experiments (80%)
Wrote and edited paper (35%)
Figures and tables (100%)
Author Cheong Xin Chan Designed experiments (10%)
Wrote and edited paper (10%)
Author Mark A. Ragan Designed experiments (20%)
Wrote the paper (55%)
v C. X. Chan, G. Bernard, O. Poirion, J. M. Hogan, and M. A. Ragan. Inferring phylogenies of evolving sequences without multiple sequence alignment. Scientific Reports, 4:6504, 2014 - incorporated as part of Chapter 3.
Contributor Statement of contribution
Author Guillaume Bernard (Candidate) Designed experiments (60%)
Wrote and edited paper (30%)
Figures and tables (50%)
Author Cheong Xin Chan Designed experiments (20%)
Wrote the paper (55%)
Figures and tables (50%)
Author Mark A. Ragan Designed experiments (20%)
Wrote and edited paper (10%)
Author James M. Hogan Designed experiments (5%)
Wrote and edited paper (5%)
Author Olivier Poirion Designed experiments (5%)
vi G. Bernard, C. X. Chan and M. A. Ragan. Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer, Scientific Reports, 6:28970, 2016 - incorporated as Chapter 4.
Contributor Statement of contribution
Author Guillaume Bernard (Candidate) Designed experiments (80%)
Wrote the paper (60%)
Figures and tables (100%)
Author Cheong Xin Chan Designed experiments (10%)
Wrote and edited paper (20%)
Author Mark A. Ragan Designed experiments (10%)
Wrote and edited paper (20%)
G. Bernard, C. X. Chan and M. A. Ragan. Recapitulating phylogenies using k-mers: from trees to networks. F1000Research 5, 2789, 2016 - incorporated as part of Chapter 5.
Contributor Statement of contribution
Author Guillaume Bernard (Candidate) Designed experiments (100%)
Wrote the paper (60%)
Figures and tables (100%)
Author Cheong Xin Chan Wrote and edited paper (30%)
Author Mark A. Ragan Wrote and edited paper (10%)
vii G. Bernard, Paul Greenfield, M. A. Ragan and C. X. Chan. K-mer similarity, networks of microbial genomes and taxonomic rank. bioRxiv 125237, 2017 - incorporated as part of Chapter 5
Contributor Statement of contribution
Author Guillaume Bernard (Candidate) Designed experiments (70%)
Wrote the paper (50%)
Figures and tables (100%)
Author Cheong Xin Chan Wrote and edited paper (20%)
Author Mark A. Ragan Wrote and edited paper (20%)
Author Paul Greenfield Designed experiments (30%)
Wrote and edited paper (10%)
viii Contributions by others to the thesis
No contributions by others.
Statement of parts of the thesis submitted to qualify for the award of another degree
None.
ix Acknowledgements
First and foremost, I would like to thank my Principal Supervisor Prof. Mark A. Ragan for giving me the opportunity to do my PhD under his supervision, I could not have wished for a better supervisor. Despite your busy schedule you were always available to talk about my project. You are the most brilliant mind that I know and it was an honor to be your student.