Species Tree Likelihood Computation Given SNP Data Using Ancestral Configurations
Total Page:16
File Type:pdf, Size:1020Kb
Species Tree Likelihood Computation Given SNP Data Using Ancestral Configurations DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Hang Fan, M.S. Graduate Program in Statistics The Ohio State University 2013 Dissertation Committee: Professor Laura Kubatko, Advisor Professor Bryan Carstens Professor Radu Herbei 1 Copyright by Hang Fan 2013 2 Abstract Inferring species trees given genetic data has been a challenge in the field of phylogenetics because of the high intensity during computation. In the coalescent framework, this dissertation proposes an innovative method of estimating the likelihood of a species tree directly from Single Nucleotide Polymorphism (SNP) data with a certain nucleotide substitution model. This method uses the idea of Ancestral Configurations (Wu, 2011) to avoid the computation burden brought by the enumeration of coalescent histories. Importance sampling is used to in Monte Carlo integration to approximate the expectations in the computation, where the accuracy of the approximation is tested in different tree models. The SNP data is processed beforehand which vastly boosts the efficiency of the method. Gene tree sampling given the species tree under the coalescent model is employed to make the computation feasible for large trees. Further, the branch lengths on the species tree are optimized according to the computed species tree likelihood, which provides the likelihood of the species tree topology given the SNP data. For inference, this likelihood computation method is implemented in the stepwise addition algorithm to infer the maximum likelihood species tree in the tree space given the SNP data, and simulations are conduced to test the performance. We also apply this method to the problem of species delimitation in the purpose of validating proposed species delimitations given the SNP data, and we run simulations to check the validation ii outcomes under different scenarios, such as in the presence of subsampling in the SNP data. iii Dedication This document is dedicated to my parents. iv Acknowledgments Looking back in time, just like the coalescent theory used in my dissertation, I have been enjoying so much in my PhD study in statistics. Moreover, I have gained the best experience in my life and this will continue to benefit my future career and life. I am writing the acknowledgement in the last moment before submission, because this will come to an end for my PhD odyssey and it is certainly hard to say goodbye to many people I am grateful to and the time being a PhD student. Life is full of decisions. There are not many people or opportunities, however, which can make life different. My dear advisor Professor Laura Kubatko is certainly one of my life- changing angels. Thanks to her, I made my transition from biology to statistics, the field I have passion for. She guided me to discover the joy of research, and also showed me of how to do things in a professional way. In personal life, she is also a great friend and I am still amazed by how she built a balance between career and life. PhD study is a tough and challenging exploration. Hence, it is extremely important for me that Laura has always been supportive, encouraging, patient and understanding. Words cannot say enough about my gratitude to her. I feel so blessed to have Laura as my advisor and she is no doubt my role model. v I want to thank Professor Radu Herbei, who is my most favorite teacher and also sit on my committee. I took ALL the statistics classes he taught, including probability theory series, stochastic process, large sample theory, stochastic differential equations, etc. I truly love learning from him. He always explains theories intuitively, and he made the beauty of math and statistics visible and fascinating. Professor Bryan Carstens, another committee member, also helped me a lot in my research. I highly appreciated Bryan’s flexibility and kindness to discuss research with me. With his help, I had a much better understanding to species delimitation. Bryan also gave me many valuable suggestions from the perspective of an empirical biologist. I had so many wonderful memories in Department of Statistics at OSU. Professor Mark Berliner’s class simply enlightened my interest into statistics. I enjoyed the classes taught by Professor Peter Craigmile, Professor Doug Wolfe, Professor Tom Santner, Professor Angela Dean and Professor Dennis Pearl. I also want to thank our kind staff to offer me so much help. Outside school, I would like to thank my awesome friends to bring sunshine and joy to my life. Special thanks to my friends, Dr. Chia-Hua Lin and Dr. Agus Munoz-Garcia. Their friendship accompanies me to go through ups and downs in my PhD years. vi At last, I want to thank my parents, Xiaohong Zhang and Yamin Fan, for EVERYTHING. In my last year of PhD, when I was very pressured from research and job hunting, they once told me, “no matter what happens, there are always two people on the other side of the world, thinking of you.” My parents are respective and supportive to all my decisions and their love drives me to pursuit my dreams. Now dream comes true. In the end, again, I want to thank the people here who have helped me. Thanks to you, I become a Doctor of Philosophy and a person I’m proud of. vii Vita 2003 ............................................................ Tanglai High School 2007 ............................................................ B.S. Biological Sciences, Tsinghua University 2010 ............................................................ M.S. Evolution, The Ohio State University 2010 to present ........................................... Graduate Teaching Associate, Department of Statistics, The Ohio State University Publications Helen Hang Fan and Laura S. Kubatko. 2011. Estimating species trees using approximate Bayesian computation. Molecular Phylogenetics Evolution 59: 354-36. Laura S. Kubatko and Helen Hang Fan. 2012. Reply to “Letter to the Editor on the article entitiled “Estimating species trees using Approximate Bayesian Computation” (Fan and Kubatko, Mol.Phylogenetics Evol. 59, 354-363). Molecular Phylogenetics and Evolution 66(1): 438-439. Zexuan Li, Yishu Huang, Jing Ge, Hang Fan, Xiaohong Zhou, Shentao Li, Mark Bartlam, Honghai Wang, and Zihe Rao. 2007. The crystal structure of MCAT from Mycobacterium tuberculosis reveals three new catalytic models. Journal of Molecular Biology 371(4): 1075-1083. viii Fields of Study Major Field: Statistics ix Table of Contents Abstract .......................................................................................................................... ii Dedication ..................................................................................................................... iv Acknowledgments ...........................................................................................................v Vita ............................................................................................................................. viii Table of Contents ............................................................................................................x List of Tables............................................................................................................... xiii List of Figures ..............................................................................................................xiv Chapter 1: Background and Literature Review ................................................................1 1.1 A definition of species tree and gene tree................................................................1 1.2 Models for SNP evolution along gene trees ............................................................6 1.2.1 Nucleotide substitution models ........................................................................7 1.2.2 Computation of the likelihood of a gene tree given a species tree and SNP data ............................................................................................................................... 12 1.3 Models for generating gene trees from a species tree ............................................ 15 1.4 The overall model for the evolution of SNP data given a species tree ................... 22 1.5 A survey of methods used for inferring species trees from genetic data ................ 23 x 1.5.1 Sequence-based methods ............................................................................... 24 1.5.2 Summary statistic methods............................................................................. 26 1.6 Overview of this dissertation ................................................................................ 27 Chapter 2: Likelihood Computation Using Ancestral Configurations ............................. 28 2.1 Definition of AC .................................................................................................. 28 2.2 Advantages of using AC ....................................................................................... 33 2.3 A method for the likelihood computation of a species tree given SNP data using ACs............................................................................................................................ 34 2.4 SNP data processing and gene tree sampling to improve computational efficiency .................................................................................................................................