Inferring the Characteristics of Ancient Populations Using Bioinformatic Analysis of Genome-Wide DNA Sequencing Data
Total Page:16
File Type:pdf, Size:1020Kb
Inferring the Characteristics of Ancient Populations using Bioinformatic Analysis of Genome-wide DNA Sequencing Data Graham Gower Australian Centre for Ancient DNA School of Biological Sciences Faculty of Sciences University of Adelaide Thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy February 2019 Contents 1 Introduction1 1.1 Evolution and population dynamics................ 2 1.2 Inference based on modern data.................. 3 1.2.1 Detecting gene flow..................... 3 1.2.2 Coalescent theory and population size inference..... 4 1.3 Direct observation using ancient DNA .............. 6 1.3.1 What is ancient DNA?................... 6 1.3.2 Challenges for ancient DNA studies............ 7 1.3.3 Ancient DNA analyses................... 8 1.4 Thesis overview........................... 9 1.4.1 Motivation & Aims..................... 9 1.4.2 Early cave art and ancient DNA record the origin of European bison....................... 10 1.4.3 Population size history from short scaffolds: how short is too short?......................... 10 1.4.4 Widespread male sex bias in mammal fossil and museum collections.......................... 11 1.4.5 PP5mC: preprocessing hairpin-ligated bisulfite-treated DNA sequences....................... 12 1.5 References.............................. 12 2 Early cave art and ancient DNA record the origin of European bison 25 2.1 Authorship statement........................ 25 2.2 Manuscript ............................. 34 2.3 Supplementary Information .................... 42 2.4 Supplementary Data 1: sample details . 112 3 Population size history from short scaffolds: how short is too short? 125 3.1 Authorship statement........................125 3.2 Manuscript .............................128 4 Widespread male sex bias in mammal fossil and museum collections 143 4.1 Authorship statement........................143 4.2 Manuscript .............................146 4.3 Supplementary Information ....................167 5 PP5mC: preprocessing hairpin-ligated bisulfite-treated DNA sequences 175 5.1 Authorship statement........................175 5.2 Manuscript .............................177 5.3 Supplementary Figures.......................192 6 Discussion 197 6.1 Research summary .........................198 6.2 Primary outcomes..........................198 6.2.1 European bison ancestry..................198 6.2.2 PSMC and MSMC with short scaffolds . 199 6.2.3 A male bias is ubiquitous in mammal collections . 199 6.2.4 More efficient processing of HBS-seq data . 200 6.3 Synthesis...............................201 6.4 Limitations and future directions . 202 6.4.1 Models for the origin of European bison . 202 6.4.2 SMC-based population size inference . 202 6.4.3 Drivers of male-biased sex ratios . 204 6.4.4 Prospects for paleo-epigenetics . 205 6.4.5 PP5mC computational performance . 205 6.5 Concluding remarks.........................206 6.6 References..............................206 Abstract In this thesis, I apply, evaluate, and develop methods for learning about past populations from genome-wide sequencing data. Specifically, I: apply methods based on random genetic drift between populations, to • determine that pre-Holocene gene flow occurred between the ancestors of domestic cattle (Bos primigenius) and European bison (Bison priscus), and that the contribution of Bos genealogy to the bison lineage was less than 10 %; use simulations to assess the impact of short genomic scaffolds when • inferring past populations sizes with the pairwise sequentially Markov coalescent, and show that population size inferences can be robust for scaffold lengths as short as 100 kb; perform genetic sex determination of ancient DNA specimens to show • that bison (Bison spp.) and brown bear (Ursus arctos) specimens are approximately 75 % male, and that male-biased observations likely stem from the ecological and social structures of the populations; and develop a suite of software tools for processing hairpin bisulfite • sequencing data, which can be used to investigate genome-wide DNA methylation levels in ancient DNA. Acknowledgements This thesis would not have been possible without the ongoing support of my su- pervisors Alan Cooper, Bastien Llamas, Julien Soubrier, and Kieren Mitchell. You provided me with an opportunity, and the freedom, to learn a great deal about a broad range of interesting biological phenomena. Thank you for your time, energy, criticism, and many motivating ideas. I also wish to thank all the members of ACAD, for enjoyable journal club discussions, social events, and sporadic tomfoolery. And speaking of tomfoolery, it’s been a pleasure to col- laborate with our friends in the School of Mathematical Sciences—Nigel, Jono, and Ben, you’ve inspired me to spend my leisure time reading maths books. Finally, to my partner Saskia; thank you for being there for me no matter what, and thank you for listening to my long winded and overly-detailed ex- planations with far too many digressions before sometimes finally getting to the point. Chapter 1 Introduction 1 2 CHAPTER 1. INTRODUCTION 1.1 Evolution and population dynamics Evolutionary biologists wish to learn about organisms living now, and those that lived in the past. Motivations are diverse and include understanding evolutionary processes, conservation of species, or characterising the etiology of diseases. How are living things related to each other? How did their form and function come to be? What causes them to go extinct? What processes that are occurring now, have also occurred in the past? Can the past and the present tell us about the future? Such ideas have a long history in biology, starting with Darwin, who was immensely influenced by the uniformitarianism of Lyell(1835). By comparing living species with those that are extinct, we can derive hypotheses about the circumstances for extinction, and why similar species did not suffer the same fate. Using what we learn of the past we can try to predict, and hopefully pacify, current and future extinction risk. Population parameters such as spatial distribution, age-specific mortality, sex ratios, migration rates between populations, and population size can all be informative regarding behavioural and ecological characteristics. Changes to these population parameters over time can further pinpoint responses to the environment, with regard to large-scale disruptions from geological and climatic events. Demographic parameters of past populations have often been discussed by comparing the morphology of extant species, in the context of morphological assessments of fossils (Robson & Wood, 2008). This is made possible by a wealth of reference material for hominid remains, for example, to morphologically determine a specimen’s sex (Frayer & Wolpoff, 1985; Rehg & Leigh, 1999), and age at death (Dean & Liversidge, 2015; Dean, 2016). Increasingly, genetic data are being used in addition to, or in place of, mor- phological data. This is particularly important where remains are fragmentary or rare, such as for Denisovans, a hominin group originally described from a finger bone and a tooth (Reich et al., 2010). But genetics also offers to answer questions that were previously difficult or impossible to answer, such as de- termining average generation times (Moorjani et al., 2016), inferring ancestral population sizes (Li & Durbin, 2011), quantifying past gene flow (Patterson et al., 2012), or clarifying that distinct morphological forms are different sexes of a single species (Bunce et al., 2003; Huynen et al., 2003; Bover et al., 2018). Like for morphological data, two complementary approaches exist for investig- ating past populations with genetic data: use the data from modern individuals to infer things about their ancestors; or observe the populations directly from the DNA of subfossil remains. 1.2. INFERENCE BASED ON MODERN DATA 3 1.2 Inference based on modern data 1.2.1 Detecting gene flow The ancestors of modern populations leave extensive signatures in the genomes of their descendents. Phylogenetic relationships can be used to infer population split times (Bouckaert et al., 2014; Molak et al., 2013) and evolutionary rates of change (Bouckaert et al., 2014; Rabosky, 2014). But relationships between in- dividuals within a population are rarely tree-like, and similarly, speciation may not be characterised by clean separation into reciprocally monophyletic groups (Mallet, 2005). To detect gene flow in past populations, a four-population test statistic (F4) has been developed (Reich et al., 2009; Patterson et al., 2012), which is sensitive to even small quantities of gene flow. The principal idea of the test is that in a phylogeny, the components of random genetic drift along two distinct branches are uncorrelated, whereas if gene flow existed between the two branches then genetic drift will be correlated. A B C D Figure 1.1: A four taxon phylogenetic tree with branches representing random genetic drift. The expected difference in allele frequencies between and ( and ) is due to drift occurring along the path shown in blue (red).A B C D Consider the four populations ; ; and in Figure 1.1, and suppose we wish to test for gene flow betweenA B C and D (or and ) that occurred after and split. If a, b, c, and d areA the respectiveC B alleleC frequencies in A B populations ; ; ; and , then Patterson’s F4 = cov(a b; c d). When the four populationsA B C haveD a strictly tree-like relationship,− then− the A B drift is uncorrelated with the drift, so F4 = 0. In contrast, if gene flow has occurred between