Developing Computational Tools for Evolutionary Inferences in Polyploids
Total Page:16
File Type:pdf, Size:1020Kb
Developing Computational Tools for Evolutionary Inferences in Polyploids Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Paul David Blischak, B.Sc. Graduate Program in Evolution, Ecology, and Organismal Biology The Ohio State University 2018 Dissertation Committee: Andrea D. Wolfe, Advisor Bryan C. Carstens John V. Freudenstein Laura S. Kubatko © Copyright by Paul David Blischak 2018 Abstract Methods for generating genome-scale data sets are facilitating the inference of phyloge- netic relationships in non-model taxa across the Tree of Life. However, rapid speciation and heterogeneous patterns of diversification make this task difficult when gene trees have conflicting histories (e.g., from incomplete lineage sorting). For plant species in particular, additional complications arise due to the intermixing of divergent lineages through hybridization and the subsequent occurrence of whole genome duplication (WGD; i.e., allopolyploidy). Investigations regarding the evolutionary history of re- cently formed polyploids and their diploid progenitors are difficult to conduct because of problems with resolving ambiguous genotypes in the polyploids as well as analyzing species with different ploidies. The focus of my dissertation has been to develop models and bioinformatic tools for analyzing high-throughput sequencing (HTS) data collected in non-model taxa of different ploidy levels to estimate phylogenetic relationships. I am applying these tools in the plant genus Penstemon (Plantaginaceae) to infer the relationships in two groups of closely related species containing diploids, tetraploids, and hexaploids. The first chapter of my dissertation uses HTS data and a hierarchical Bayesian framework to estimate biallelic single nucleotide polymorphism (SNP) genotypes and allele frequencies in populations of any ploidy level (diploid or higher) assuming Hardy Weinberg equilibrium. It does this using Markov chain Monte Carlo (MCMC) to ii integrate over the uncertainty in the estimated genotypes. I then assess the model’s accuracy using simulations and test it on a SNP data set in autotetraploid potato (Solanum tuberosum). Both of these tests demonstrate the usefulness of the model for parameter inference at different ploidy levels. The MCMC algorithm that is used for inference is implemented in the open source R package polyfreqs. The set of models in my second chapter builds on Chapter 1 in two important ways. First, I extend the Hardy Weinberg equilibrium model to include inbreeding. Second, I directly address the hybrid nature of allopolyploid organisms by separately modeling the genomes of the two parental species. Using both simulations and empirical data sets from the literature (autopolyploid: Andropogon gerardii, allopolyploid: Betula pubescens + diploid parent: B. pendula), I benchmark these methods against other software (Genome Analysis Toolkit) to demonstrate their effectiveness for estimating genotypes. These new models also use a different algorithm for inferring population parameters, the expectation maximization algorithm, which I have implemented in the open source software package ebg. Chapter 3 uses ideas similar to those presented in the first two chapters, but focuses on inferring full haplotype sequences, rather than single SNP genotypes, for samples of arbitrary ploidy. The method is able to process paired-end HTS data collected using double-barcoded amplicon sequencing, and uses the program PURC to cluster sequencing reads into haplotypes. It then uses a multinomial likelihood to infer haplotypes while also accounting for sequencing error. The pipeline is implemented in the software Fluidigm2PURC, and I demonstrate its use on a polyploid series from the genus Thalictrum (Ranunculaceae). iii My final chapter uses nuclear amplicon sequencing to infer evolutionary relation- ships between two closely related groups in Penstemon: subsections Humiles and Proceri (Plantaginaceae). These two groups are known to hybridize and have docu- mented cases of WGD events forming putative allotetraploids and allohexaploids. To estimate phylogeny in these two groups, I first use the methods described in Chapter 3 to determine haplotypes from paired-end HTS data for all diploid, tetraploid, and hexaploid individuals. I also develop a method for assessing the proportion of gene trees supporting a species-level quartet (quartet concordance factors; QCFs), which I use as input for estimating a species network using the program SNaQ. Phylogenies inferred using both species tree, and network, approaches recover subsections Humiles and Proceri as non-monophyletic. There is also strong evidence for hybridization within and between these two groups. iv This is dedicated to my grandparents: Doris & Michaely Blischak and Maryy & Carl Firm v Acknowledgments First and foremost, I would like to thank my advisor, Dr. Andrea Wolfe, for her guidance and support throughout the process of getting my Ph.D. I started working with Andi as an undergraduate math major, and she convinced me to go to grad school, and to give biology a try. I will be forever grateful for this advice. My co-advisor for this undergrad research project with Andi was Dr. Laura Kubatko, who has continued to serve as a mentor and “unofficial” Ph.D. co-advisor, for which I am very thankful. Working with Andi and Laura has been an incredible experience, and I will miss getting to chat in their offices while mulling over the latest issues about why none of my research is working. I would also like to thank my other committee members, Dr. Bryan Carstens and Dr. John Freudenstein, who provided invaluable insights into key areas of my thesis, and were always happy to discuss my research or to help troubleshoot. I would like to thank all members of the Wolfe Pack, both past and present. When I first started in the lab, I was beyond clueless, but my senior lab mates, Dan Robarts and Aaron Wenzel, were an immense help, and continued to guide me through the early years of my graduate work. I’d also like to acknowledge my current lab mates, Ben Stone and Rosa Rodriguez, who are great friends and colleagues, and who have helped me bounce around ideas or problems that I was having with my research on numerous occasions. vi Getting through this Ph.D. would not have been possible without my family. My parents, Maggi and Dave, have shown indefatigable love and support throughout all stages of my education, even when I had absolutely no idea what I wanted to do with my life. They are model human beings, and I hope to live up to the example that they have set for me. My siblings, John and Julianna, are equally badass. Not only are they great couch-fort builders, living-room soccer players, and backyard mat tumblers, they are incredible friends. I am also immeasurably fortunate to have an amazing partner, Makenzie Mabry, whose love and encouragement has sustained me through both the high and low points of my Ph.D. over the past couple of years. I would like to thank everyone who helped me in the field and with obtaining collecting permits, including Mikel Stevens, Noel Holmgren, Karen and Steve Shelly, Carol Blackburn, Teresa Prendusi, Dale Reinhart, Maret Pajutee, and Steve Popovich. I also thank the organizations that provided funding for my research: The Ohio State University (Distinguished University Fellowship), the National Science Foundation (Doctoral Dissertation Improvement Grant; DEB-1601096), the Society for Systematic Biologists (Graduate Student Research Award), and the American Society of Plant Taxonomists (Graduate Student Research Grant). Computational resources for my research were provided by the Ohio Supercomputer Center and the College of Arts and Sciences Unity Cluster at OSU. I would also like to acknowledge Drs. Xin He, John Novembre, and Matthew Stephens, as well as their lab members, for allowing me to come visit and present my research at the University of Chicago, and thanks especially to my brother John for orchestrating this opportunity. vii To all of the wonderful friends that I have made while working in EEOB, thank you all for the memories, the laughs, and the good times. I will miss you all, and I hope we’ll bump into each other as frequently as possible. viii Vita 2008 . .Archbishop Hoban High School. 2012 . .B.Sc. Mathematics, The Ohio State University. 2012-2013, 2017-2018 . Distinguished University Fellow, The Ohio State University. 2013-2015 . Graduate Teaching Associate, The Ohio State University. 2015-2017 . Graduate Research Associate, The Ohio State University. Publications Research Publications Blischak, P. D., M. Latvis, D. F. Morales-Briones, J. C. Johnson, V. S. Di Stilio, A. D. Wolfe, and D. C. Tank. Fluidigm2PURC: Automated Processing and Haplotype Inference for Double-Barcoded PCR Amplicons. Applications in Plant Sciences, 6:e1156, 2018. Blischak, P. D., J. Chifman, A. D. Wolfe, and L. S. Kubatko. HyDe: a Python Package for Genome-Scale Hybridization Detection. Systematic Biology, doi:10.1093/sysbio/syy023, 2018. Blischak, P. D., L. S. Kubatko, and A. D. Wolfe. SNP Genotyping and Parameter Estimation in Polyploids Using Low-Coverage Sequencing Data. Bioinformatics, 34:407–415, 2018. Latvis, M., S. J. Jacobs, S. M. E. Mortimer, M. Richards, P. D. Blischak, S. Mathews, and D. C. Tank. Primers for Castilleja and Their Utility Across Orobanchaceae: II. Single-Copy Nuclear Loci. Applications in Plant Sciences, 5:1700038, 2017. ix Wolfe, A. D., T. Necamp, S. Fassnacht,