UNIVERSITY of CALIFORNIA SANTA CRUZ Bioinformatic Approaches To
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF CALIFORNIA SANTA CRUZ Bioinformatic Approaches to Admixture and Symbiosis A dissertation submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in BIOMOLECULAR ENGINEERING AND BIOINFORMATICS by Paloma Medina June 2021 The Dissertation of Paloma Medina is approved: ______________________________ Professor Russell Corbett-Detig, Chair ______________________________ Professor Richard Green ______________________________ Professor William Sullivan ________________________________ Quentin Williams Acting Vice Provost and Dean of Graduate Studies Copyright © by Paloma Medina 2021 Table of Contents List of Figures iv List of Tables vi Abstract vii Dedication viii Acknowledgements ix General Introduction 1 Chapter 1: Estimating the Timing of Multiple Admixture Pulses During Local Ancestry Inference 4 Abstract 5 Introduction 5 Methods 10 Model 17 Validation 26 Application 41 Conclusions 50 Chapter 2: Deep data mining reveals variable abundance and distribution of microbial reproductive manipulators within and among diverse host species 53 Abstract 54 Introduction 55 Results and Discussion 62 Acknowledgements 98 Chapter 3: Anopheles gambiae is not a natural host of Wolbachia 99 Abstract 100 Introduction 101 Methods 104 Results and Discussion 108 Conclusion 114 Synthesis and Future Directions 117 Bibliography 119 iii List of Figures Figure 1 Schematic of admixture models……………………......…………...……….8 Figure 2 Tract length distributions obtained using our tract length model approximation (solid black) and forward-in-time simulation (dashed red)...............................................................................................................27 Figure 3 Admixture time estimates for two-pulse population models with three distinct ancestral populations…………………………………………..…30 Figure 4 Two-pulse admixture models fitted using our framework to data generated under a single-pulse admixture model………………………………....….33 Figure 5 Two-pulse admixture models fitted using our framework to data generated under a two-pulse admixture model………………………...…………….35 Figure 6 Admixture model fitted for data consistent with admixed Drosophila populations…………………………………………………….....……….43 Figure 7 Single- and double-pulse models applied to real genotype data generated from sub-Saharan African populations of D. melanogaster……………………………………………………...……....45 Figure 8 Most likely admixture model fitted to real data generated from Gambella, Ethiopia (EA)..............................................................................................48 Figure 9. A schematic of the computational pipeline used to determine the reproductive manipulator infection status and endosymbiont titer of a sequencing run……………………………………………..……………...63 Figure 10. Phylogeny of Arthropoda orders tested and number of reproductive manipulator positive species within each order ………………………….69 Figure 11. Wolbachia titer among D. melanogaster and D. simulans computed from our (A) in silico titer estimate approach (y-axis log10 scaled) and (B-E) in vivo fluorescence assay in stage 10a oocyte cysts ………………...……..73 Figure 12. Titer variation across species infected with Wolbachia and between diverse reproductive manipulator clades ……………………………....……..…..76 iv Figure 13. Anopheles gambiae tagged sequence read libraries aligned to Wolbachia reference genome via BLASTN ...……………………………….……...109 Figure 14. Phylogenetic trees of Wolbachia MLST genes …………………….…...113 v List of Tables Table 1. Raw infection counts of Wolbachia, Spiroplasma, Rickettsia and Arsenophonus infecting arthropods and nematodes……………………….64 Table 2. Estimated infection frequencies and confidence intervals from our data for Wolbachia, Spiroplasma, Rickettsia and Arsenophonus infecting arthropods and nematodes……………………………………………………………..67 Table 3. Wolbachia global infection frequencies and confidence intervals generated for arthropod orders………………………………………………………..70 Table 4. Reference-based and non-reference based approaches to label the taxonomic group of nine samples with putative Wolbachia infection………………..111 vi Abstract Bioinformatic Approaches to Admixture and Symbiosis By Paloma Medina As DNA sequencing data becomes more prevalent, population genetics and evolutionary genomics are becoming increasingly more data driven, requiring the development of new tools to work with large datasets. In my doctoral research, I develop bioinformatic tools to study two important contributors to evolution: admixture and symbiosis. In my first chapter, I apply a coalescent theory model to Drosophila melanogaster populations and show evidence for African and European admixture in sub-Saharan populations of this species. In subsequent chapters, I develop a data mining approach to classify bacterial symbiont infections in publicly sequencing databases. Moreover, I investigate bacterial density within a wide range of host species and find evidence for symbiont induced host genome evolution. vii Dedication To my mom, your strength and love show me my humanity. Thank you for loving me and rooting for me. To my dad, your energy makes me feel like everything is possible. Thank you for seeing the best in me and always being there for me. To my sister, your wit is sharp. Thank you for inspiring me to be my best self and trusting me. To my brother, your kindness and humanitarian nature are so special. Thank you for reminding me of what is important. viii Acknowledgements Russ Corbett-Detig, thank you for helping me develop skills that will stay with me the rest of my life. You are smart, generous, and kind, and I am thankful for the hours of mentorship and guidance you have invested in me. I hope to make you proud with all that I do. I would also like to acknowledge my past mentors, Bryan Thines, Frank Chan, Felicity Jones, and the Professors at Scripps College who have all supported my research endeavors. Shelbi Russell, thank you for the conversation and support in my research and me as a person throughout my years here. You inspire me to keep on, and be the best scientist I can be. The Corbett-Detig Lab, thank you for making these past few years joyful and exciting. #NotTrash The Science and Justice Research Center and friends. Jenny Reardon thank you for offering space to explore my research from different perspectives, these spaces gave me a sense of belonging. To my Queer Ecology fellows, Elana Margo, Raed Rafei, Dennis Browe, Dorothy Santos, and more, your scholarship and organizing power was amazing to be in community with. To the feminist scholars outside of SJRC, Deboleena Roy, Karen Barad, Kim Tallbear, thank you for your unconditional support and mentorship, and for seeing me. Colleagues from the Center for Teaching and Learning: Kendra Dority, Nandini Bhattacharya, the Pedagogy Fellows Cohort. I am grateful to have served undergraduates on this campus with you. It was a pleasure working with all of you. I want to acknowledge Sandra Dreisbach and David Bernick and Zia Isola in this light as well. To my friends, Sabrina Shirazi, Katie Moon, Chloe Orland, Landen Gozashti, and the many people I met through COLA organizing and the Queer community, thank you for bringing Life into every day of my existence. Thank you for holding space for me to cry, supporting me through hard times, and reminding me how expansive life can be. ix General Introduction As DNA sequencing data becomes more prevalent, fields like population genetics and evolutionary genomics have become increasingly more data driven, requiring development of tools to work with large datasets. Here, I present my work to develop computational tools that harness big data science to extract biological meaning and knowledge in the fields of population genetics and evolutionary genomics. Indeed, there is a growing appreciation for the evolutionary forces that affect population structure and genomics across the tree of life. Two areas of study in this dissertation have long been heralded as fundamental forces of evolution: admixture and symbiotic associations. The heritability of DNA makes DNA a catalogue of historical information and the study of DNA has revealed demographic histories of plant and animal populations. One important evolutionary process is admixture, wherein genetically divergent populations encounter each other and hybridize. If an admixture event has occurred relatively recently, we can use local ancestry inference methods to trace the ancestry of discrete genomic segments back to the ancestral populations from which they are derived (Pool and Nielsen 2009; Price et al. 2009; Sankararaman et al. 2012; Maples et al. 2013; Corbett-Detig and Nielsen 2017). Due to ongoing recombination in an admixed population, the lengths of ancestry tracts from an ancestral population are expected to be inversely related to the timing of admixture. Therefore, it is 1 possible to estimate the timing of admixture by inferring ancestry tract lengths (Pool and Nielsen 2009; Gravel 2012). In Chapter 1, I present an extension and application of Ancestry_HMM, a Hidden Markov Model method that can be used for local ancestry inference and estimate the timing of ancient admixture pulses. Previously, this software was able to detect a single admixture pulse. Now, Ancestry_HMM is able to detect multiple admixture pulses from multiple source populations. Our method uses read pile-up data and does not require a priori local ancestry inference. I applied our method to real DNA sequencing data from Drosophila melanogaster, the fruit fly, and found evidence for