Bayesian Statistical Modelling of Genetic Sequence Evolution
Total Page:16
File Type:pdf, Size:1020Kb
Bayesian statistical modelling of genetic sequence evolution Konstantinos Angelis University College London Department of Genetics, Evolution and Environment 2015 Thesis submitted to the University College London for the degree of Doctor of Philosophy 2 I, Konstantinos Angelis, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. Specifically, four chapters of this thesis involve joint work which is specified below: Chapter 3. This chapter is a modification of the work of Angelis et al. (2014) and includes contributions by Dr. Mario dos Reis (MdR) and Prof. Ziheng Yang (ZY). Konstantinos Angelis (KA), MdR and ZY designed the study, build the model and interpreted the results. KA wrote the software and performed the simulation and real data analysis. Chapter 4. This chapter is a modification of the work of Angelis and dos Reis (2015) and includes contribution of MdR. KA and MdR designed the study and interpreted the results. KA performed the simulation and real data analyses. Chapter 5. This chapter includes contribution of MdR and ZY. KA, MdR and ZY designed the study. KA performed the analysis and interpreted the results. Chapter 6. This chapter is a modification of the work of dos Reis et al. (2015) and includes contribution of MdR, Yuttapong Thawornwattana (YT), Prof. Maximilian Telford, Prof. Philip Donoghue and ZY. YT prepared the data and set up the analysis pipeline. KA, YT and MdR performed the analysis. All researchers contributed to the interpretation of the results. This work is a highly collaborative effort of many people and integrates expertise from different fields such as zoology, palaeontology, molecular evolution and statistics. In addition to the above, during my PhD studies I published the following papers, which do not form part of the thesis: Paraskevis D, Angelis K, Magiorkinis G, Kostaki E, Ho S, Hatzakis A. 2015. Dating the origin of hepatitis B virus reveals higher substitution rate and viral adaptation to the branch leading to F/H genotypes. Mol Phylogenet Evol 93: 44-54. 3 Angelis K, Albert J, Mamais I (>10 authors), et al. 2014. Global dispersal pattern of HIV-1 CRF01_AE: A genetic trace of human mobility related to heterosexual activities centralized in South-East Asia. J Infect Dis 211:1735–1744. 4 Abstract Bayesian statistics has been at the heart of phylogenetic inference over the last decade, particularly after the development of powerful programs that implement efficient Markov chain Monte Carlo algorithms, allowing inference from multi-parametric problems in realistic time frames. In this thesis we develop and test Bayesian methods to analyse molecular sequence data to address important biological questions. First, we review some fundamental aspects of Bayesian inference and highlight current Bayesian applications in molecular evolution with particular focus in studying natural selection and estimating species divergence times. Then, we develop a new Bayesian method to estimate the nonsynonymous/synonymous rate ratio and evolutionary distance for pairwise sequence comparisons. The new method addresses weaknesses of previous counting and maximum- likelihood methods. It is also computationally efficient and thus suitable for genome-scale screening. Then, we explore the performance of existing Bayesian algorithms in estimating species divergence times. In particular, we study the impact of ancestral population size and incomplete lineage sorting on Bayesian estimates of species divergence times under the molecular clock, when those factors of molecular evolution are ignored by the inference model. The estimates can be highly biased, especially in the case of shallow phylogenies with large ancestral population sizes. Then, using computer simulations and real data analyses we study the effect of five commonly used partitioning strategies for divergence times estimation and show that the choice of the partitioning scheme is important in case of serious clock-violation with incorrect prior assumptions. Finally, a Bayesian molecular clock dating study is performed to estimate the timeline of animal evolution. The results indicate that the time estimates are highly variable, precluding the inference of a precise timescale of animal evolution based on the current data and methods. 5 Acknowledgements First and foremost, I would like to thank deeply my supervisor Ziheng Yang for his support, guidance, knowledge and useful advice he has given me throughout the three years of my PhD studies. He has been an inspiring supervisor and a perfect collaborator. I would also like to express my sincere gratitude to my exceptional collaborator Mario dos Reis for all his guidance and valuable advice he has given me. I thank him for our perfect collaboration and for the experience and precious knowledge he gave me. He will always be a good friend. Special thanks to Yuttapong Thawornwattana. He started the work of chapter 6 as part of his undergraduate project at UCL. I took on the analytical part of the project, and helped completed it in high collaboration with Maximilian Telford, Philip Donoghue, Mario dos Reis and Ziheng Yang. Many thanks to all people I have worked with including the members of our lab Jose Antonio Barba Montoya and Daniel Dalquen for useful scientific feedback and discussions concerning my projects. I would also like to particularly acknowledge the University College London and the Department of Genetics, Evolution and Environment for the financial support without which this work would not have been possible. Lastly, I would like to thank my wife Foteini Kosmidi for her constant support and patience all these years. 6 Contents Abstract ................................................................................................................................... 5 Acknowledgements ................................................................................................................ 6 Contents .................................................................................................................................. 7 List of Figures ....................................................................................................................... 10 List of Tables ........................................................................................................................ 12 Introduction .......................................................................................................................... 14 1 Bayesian Theory…………………. .............................................................................. 16 1.1 Bayesian inference ................................................................................................. 16 1.2 Some advantages of Bayesian statistics ................................................................. 18 1.3 The impact of the prior........................................................................................... 19 1.4 Markov chain Monte Carlo techniques .................................................................. 21 1.5 Statistical inference methods and simulations ....................................................... 23 2 Bayesian applications in molecular evolution............................................................ 27 2.1 Programs for Bayesian phylogenetic inference...................................................... 27 2.2 Estimating the mode and strength of natural selection on a protein ...................... 29 2.2.1 Natural selection ............................................................................................ 29 2.2.2 Codon models ................................................................................................ 35 2.2.3 Techniques to identify positive selection ....................................................... 36 2.3 Bayesian estimation of species divergence times .................................................. 40 2.3.1 The molecular clock ....................................................................................... 40 2.3.2 General framework and calculation of the likelihood .................................... 42 2.3.3 Priors on node ages ........................................................................................ 43 2.3.4 Rate drift models and prior on rates ............................................................... 45 2.3.5 The limits of molecular clock dating ............................................................. 46 3 Bayesian estimation of nonsynonymous/synonymous rate ratios for pairwise sequence comparisons……………….... .............................................................................. 51 3.1 Counting and maximum likelihood methods ......................................................... 51 7 Contents 3.2 The new Bayesian approach ................................................................................... 54 3.3 Simulations ............................................................................................................. 59 3.3.1 Performance of the Bayesian method in five different data sets .................... 59 3.3.2 Analysis of simulated data.............................................................................. 61 3.4 Analysis of mammalian and bacterial data ............................................................. 66 3.4.1 Analysis of the mammalian data set ............................................................... 66 3.4.2 Analysis