Cophylogenetic Analysis of Dated Trees
Total Page:16
File Type:pdf, Size:1020Kb
COPHYLOGENETIC ANALYSIS OF DATED TREES S I O D T E ·M A RE E ·MUT N M S· E EAD A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy in the School of Information Technologies at The University of Sydney Benjamin Drinkwater August 2016 c Copyright by Benjamin Drinkwater 2016 All Rights Reserved ii Statement of Originality This is to certify that to the best of my knowledge, the content of this thesis is my own work. This thesis has not been submitted for any degree or other purposes. I certify that the intellectual content of this thesis is the product of my own work and that all the assistance received in preparing this thesis and sources have been acknowledged. Benjamin Drinkwater iii Abstract Parasites and the associations they form with their hosts is an important area of research due to the associated health risks which parasites pose to the human population. The associations parasites form with their hosts are responsible for a number of the worst emerging diseases impacting global health today, including Ebola, HIV, and malaria. Macro-scale coevolutionary research aims to analyse these associations to provide fur- ther insights into these deadly diseases. This approach, first considered by Fahren- holz in 1913, has been applied to hundreds of coevolutionary systems and remains the most robust means to infer the underlying relationships which form between coevolving species. While reconciling the coevolutionary relationships between a pair of evolution- ary systems is NP-Hard, it has been shown that if dating information exists there is a polynomial solution. These solutions however are computationally expensive, and are quickly becoming infeasible due to the rapid growth of phylogenetic data. If the rate of growth contin- ues in line with the last three decades, the current means for analysing dated systems will become computationally infeasible. Within this thesis a collection of algorithms are introduced which aim to address this problem. This includes the introduction of the most efficient solution for analysing dated coevolutionary systems optimally, along with two linear time heuristics which may be applied where traditional algorithms are no longer feasible, while still offering a high degree of accuracy (> 91%). Finally, this work integrates these incremental results into a single model which is able to handle widespread parasitism, the case where parasites infect multiple hosts. This proposed model reconciles two competing theories of widespread parasitism, while also provid- ing an accuracy improvement of 21%, one of the largest single improvements provided in this field to date. As such, the set of algorithms introduced within this thesis of- fers another step toward a unified coevolutionary analysis framework, consistent with Fahrenholz original coevolutionary analysis model. iv Acknowledgements I would like to acknowledge the support provided to me as part of my Australian Post- graduate Award. This support has made my PhD possible, as I would not have been able to undertake my PhD candidature in the topic of my choosing without this support. I would also like to acknowledge the William and Catherine Mcllrath Scholarship which allowed me to travel to Harvey Mudd College, Claremont, USA, during my Third Year as a PhD Candidate. During my time at Harvey Mudd I was hosted by Ran Libeskind-Hadas which led to many fruitful discussions about our common research topic which has greatly benefited the work presented herein. I would also like to thank his family for making me feel at home during my stay, even though I was half way around the world. Continuing the theme of funding I would also like to acknowledge the support pro- vided to me through the postgraduate research travel scheme (PRTS). I have been fortu- nate enough to qualify for assistance through this program to attend three international conferences in New Zealand, Taiwan and the United States which added so much to my PhD Candidature. A number of conversations, particularly during the coffee breaks, which have greatly benefit the work presented herein. I would also like to thank my first supervisor, Michael Charleston, now my auxiliary supervisor, having relocated to the University of Tasmania, for his helpful guidance, particularly when starting my PhD. Providing me the space to make mistakes and learn from them early has made all the difference to this final body of work and has led me to embrace the entire research process, both the highs and the lows. I would also like to thank my current supervisor Julian Mestre. Although not participating in this research directly knowing that I had someone close by, in the final stages of my time at the University of Sydney, was so important in completing this thesis. I am very appreciative for him taking me on midway through my PhD candidate, which I know required a significant amount of paperwork if nothing else and hope that I haven’t been v too much of a distraction for him. I would also like to thank Anastasios Viglas who while continuing to reject my requests to be an auxiliary supervisor throughout my PhD candidature provided many fruitful discussions on a number of topics and chapters explored in this thesis, along with always being there for an insightful conversation. The proofs presented within this work greatly benefited from our conversations when we needed a break from organising another algorithms tutorial, quiz or assignment. I would like to acknowledge the valuable help offered by Nicolas Wesike from the University of Leipzig, for his assistance in the generation of the data sets applied herein. Without his assistance a number of theoretical results inferred herein would have posed a much greater challenge to verify. I would also like to acknowledge the assistance provided by Ivan Cheung for offer- ing to read through my thesis. The conversations we have shared about my research over a cup of coffee or a game of dominion have been so useful in articulating my research and have found your perspective from a biochemistry background to be so complemen- tary to my software engineering background in describing the processes defined herein. Of course while thanking everyone around the halls of SIT would not be possible I have to thank the former and current members of Kowhai research group, particularly my PhD comrades in arms, Martin, Alex and Josh, and of course all the honourlings who have come and gone, to think I was one of you six years ago. These last few years would not have been half as enjoyable without the many entertaining conversations with all of you and off course the pub lunches at the Rose. Finally I wish to thank all my family, particularly my Mum, for putting up with me over the last few years. Especially for supporting me at times where it looked as if my PhD candidature might go on, forever, you were always there with encouragement and offered more assistance than I could have ever hoped for. Last but certainly not least I would like to thank my fiancee´ Eleni. Without her this journey would never had a beginning let alone the destination I find myself arriving at today. Thank you so much for always being there for me and I look forward to our next set of adventures together. vi List of Publications 2014 Drinkwater, Benjamin and Charleston, Michael A. (2014). An improved node • mapping algorithm for the cophylogeny reconstruction problem, Coevolution, Taylor & Francis, 2(1):1–17. Drinkwater, Benjamin and Charleston, Michael A. (2014). Introducing TreeCol- • lapse: a novel greedy algorithm to solve the cophylogeny reconstruction problem, BMC Bioinformatics, BioMed Central Ltd, 15(Suppl 16):S14. 2015 Drinkwater, Benjamin and Charleston, Michael A. (2015). A time and space • complexity reduction for coevolutionary analysis of trees generated under both a Yule and Uniform model. Computational Biology and Chemistry, Elsevier, 57:61–71. Drinkwater, Benjamin and Charleston, Michael A. (2015). A Sub-quadratic Time • and Space Complexity Solution for the Dated Tree Reconciliation Problem for Select Tree Topologies, Algorithms in Bioinformatics, Springer, 93–107. 2016 Drinkwater, Benjamin and Charleston, Michael A. (2016). RASCAL: A ran- • domised approach for coevolutionary analysis. Journal of Computational Biol- ogy, Mary Ann Liebert, Inc., 23(3):218-227. Drinkwater, Benjamin and Charleston, Michael A. (2016). Towards sub-quadratic • time and space complexity solutions for the dated tree reconciliation problem. Al- gorithms for Molecular Biology, BioMed Central Ltd, 11:15. vii Contents Statement of Originality iii Abstract iv Acknowledgementsv List of Publications vii List of Tables xiv List of Figures xxvi 1 Introduction1 1.1 Outline..................................1 1.2 Motivations................................3 1.3 Challenges.................................4 1.4 Aims....................................5 1.5 Contributions...............................6 2 Background8 2.1 Macro-Scale Coevolution.........................9 2.2 Early attempts to infer coevolutionary interactions............ 13 2.3 Handling incongruence.......................... 14 2.3.1 The four evolutionary events................... 15 2.4 Coevolutionary Inference Techniques.................. 21 2.4.1 Parsimony Analysis........................ 21 2.4.1.1 Brooks Parsimony Analysis: a means to deal with incongruence at last.................. 21 viii 2.4.1.2 Secondary Brooks Parsimony Analysis (SBPA)... 28 2.4.2 Cophylogeny Mapping...................... 30 2.4.2.1 Page’s Reconciled Tree Analysis........... 31 2.4.2.2 Maximising Codivergence Events........... 35 2.4.2.3 Welcome to the Jungle................. 36 2.4.3 Evaluating if the observed congruence is significant....... 40 2.4.3.1 An alternate approach................. 43 2.5 Contention between methods....................... 44 2.5.1 Dowling’s software comparison................. 45 2.5.2 Reception to Dowling’s software tests.............. 49 2.5.3 Brooks’ commentary on Dowling’s study............ 54 2.5.4 The end, at least for now . .................... 59 2.6 Computational Complexity Analysis................... 60 2.6.1 Approximation Algorithms...................