Population Genetics: Wright Fisher Model and Coalescent Process

POPULATION GENETICS: WRIGHT FISHER MODEL AND COALESCENT PROCESS by Hailong Cui and Wangshu Zhang Superviser: Prof. Quentin Berger A Final Project Report Presented In Partial Fulfillment of the Requirements for Math 505b April 2014 Acknowledgments We want to thank Prof. Quentin Berger for introducing to us the Wright Fisher model in the lecture, which inspired us to choose Population Genetics for our project topic. The resources Prof. Berger provided us have been excellent learning mate- rials and his feedback has helped us greatly to create this report. We also like to acknowledge that the research papers (in the reference) are integral parts of this process. They have motivated us to learn more about models beyond the class, and granted us confidence that these probabilistic models can actually be used for real applications. ii Contents Acknowledgments ii Abstract iv 1 Introduction to Population Genetics 1 2 Wright Fisher Model 2 2.1 Random drift . 2 2.2 Genealogy of the Wright Fisher model . 4 3 Coalescent Process 8 3.1 Kingman’s Approximation . 8 4 Applications 11 Reference List 12 iii Abstract In this project on Population Genetics, we aim to introduce models that lay the foundation to study more complicated models further. Specifically we will discuss Wright Fisher model as well as Coalescent process. The reason these are of our interest is not just the mathematical elegance of these models. With the availability of massive amount of sequencing data, we actually can use these models (or advanced models incorporating variable population size, mutation effect etc, which are how- ever out of the scope of this project) to solve and answer real questions in molecular biology. First we will explain concepts such as random drift, then discuss if an allele can eventually get fixed in a population, and what is the probability of genetic variation surviving after generations. After this we will illustrate in graphs the tree like nature of traversing back to most recent common ancestors (MRCA) then derive the distribution of the time back to MRCA for a sample of size 2. For the remainder of the report, we will provide a treatment of Kingman’s approximation. Finally we move on to a literature review of an application to HIV-1 regarding the average coalescent estimates of HIV-1 generation time in vivo. Keywords: Population Genetics, Wright Fisher model, Most Recent Common Ancestors (MRCA), Allele, Sequencing data, Heterozygosity, Genealogy, Coalescent process, Kingman’s approximation, HIV-1 evolution iv Chapter 1 Introduction to Population Genetics With the advent of new sequencing technology [1, 2], we are harvesting large volume of genetic data (in DNA, RNA and even protein level) and making them publically available [3, 4, 5, 6]. This enables researchers to analyze these sequencing data to tackle one of the most important challenges in modern molecular biology – how to make sense of the variations existing among the genetic information and how these variations are translated into the differences in phenotypes. For example can one capture the evolution among the tumor cells and use the observed variability to infer the velocity of disease aggravation? Another example is to research the variation among human genetic sequences to extract genes that are related to diseases such as Egr3 gene for Schizophrenia. These are some of the questions in population genetics, and for the scope of this project we aim to first introduce probabilistic models that comprise the basis of further research. Then we like to briefly review some published papers that utilize these models as well as other related methods and software in Phylogeny which all aim to eventually understand diseases better. Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two alleles and denote them by A and B. Then the population is composed of individuals with two copies of each genes, i.e., AA, AB (BA) or BB. It is convenient to classify the evolution problems by employing the time scales involved. A typical question to ask is what will happen in the future, such as “how long does a new mutant survive in the population?” or “what is the chance that an allele gets fixed in a population?”. We can think about these problems from a different angle, in other words, retrospectively, by asking where the population has been in the past instead. Many factors can affect the evolution of a population, such as random drift, selection, mutation, recombination, population subdivision etc. Nonetheless we will begin by introducing a simple model with many of these effects ignored in the next section. 1 Chapter 2 Wright Fisher Model In this section we want to begin by the introduction of the simplest Wright Fisher model (Fisher (1922), Wright (1931)). Here we assume the population is finite, of constant size, and each individual has only two alleles. We also ignore the effects of mutation, selection, etc. We assume this population undergoes random mating. This is what we call random drift, which will be discussed more formally below. 2.1 Random drift Let’s denote two alleles A and B as before at the locus of interest and assume no mutation occurs. Define Y as the number of A alleles in generation r, then N Y r ≠ r represents the number of B alleles in generation r. First, let’s make the following assumptions: Assumption 1. Discrete, non-overlapping generations of equal size N. Assumption 2. Parents of next generation of N genes are picked randomly with replacement from preceding generation (genetic differences have no fitness conse- quences). The population at generation r +1is derived from the population at time r by binomial sampling of N genes from a gene pool in which the fraction of A alleles is its current frequency, namely fii = i/N. Hence given Yr = i, the probability that Yr+1 = j is N j N j pij = fii (1 fii) ≠ , 0 i, j N. A j B ≠ Æ Æ The process Y ,r =0, 1, is a time-homogeneous Markov chain. It has { r ···} transition matrix P =(p ), and state space S = 0, 1, ,N . It is trivial that ij { ··· } the states 0 and N are absorbing; if the population contains only one allele in some generation, then it remains so in every subsequent generation. In this case, we say that the population is fixed for that allele. 2 The binomial nature of the transition matrix makes some properties of the process easy to calculate. For example, Yr 1 E(Yr Yr 1)=N ≠ = Yr 1, | ≠ N ≠ so by taking expectation on both sides, we get E(Yr)=E(Yr 1), and by recursive ≠ iteration, E(Y )=E(Y ),r=1, 2, r 0 ··· Note that: i E(Y Y = i)=N = i r+1| r N 1 2 i i V ar(Y Y = i)=N 1 r+1| r N ≠ N 1 21 2 Therefore the expected number of A (or B) alleles remains constant across generations, nonetheless variability must be lost eventually. Hence, the population ulti- mately will contain only A alleles or all B alleles. States 0 and N are absorbing states. Naturally we want to understand the probabilities of these events, so we define ai = P(eventually all alleles are A given that initially only i alleles are A) Apparently a0 =0, aN =1and Yr is a martingale as can be seen from the above equation. If we define T as the time of absorption at 0 or N and apply the optional stopping theorem, we can get E[Y ]=N P(Y = N)=N a = i T · T · i so i a = i N This means an allele will eventually become fixed in the population with the same probability as its initial proportion. As a side note, fixation in genetic sequence increases difficulty in traversing back in time to determine the common ancestors. The next question of interest is to assess how fast the genetic variation gets lost. To achieve this purpose, let’s study another widely used term in population genetics: heterozygosity. It is defined as a probability Hr that two genes chosen at random with replacement in generation r are different. 3 P Yr If we define r = N to be the proportion of A alleles in generation r, then the heterozygosity H =2P (1 P ). r r ≠ r Look at expected heterozygosity: E(H )=E(2P (1 P )) 1 1 ≠ 1 =2E(P E(P2)) 1 ≠ 1 P P 2 P =2E( 1) E( 1) V ar( 1) 3 ≠ ≠ 4 2 p0(1 p0) =2p0 p0 ≠ 3 ≠ ≠ N 4 1 =2p0(1 p0) 1 ≠ 3 ≠ N 4 1 = H0 1 3 ≠ N 4 After r generations: 1 r E(H )=H 1 r 0 ≠ N 3 r 4 H e≠ N ¥ 0 The probability Hr measures the genetic variability surviving in the population, which decays at rate 1/N per generation. The decrease of heterozygosity is a measure of random drift. As can be seen from above computation, the heterozygosity decays to 0 as r goes to infinity. The expected time for the loss is complicated to compute. As a matter of fact, due to the difficulty of finding explicit expression, one may want to resort to approximation method. Interested readers can refer to topics on diffusion approximations for further reading [7]. 2.2 Genealogy of the Wright Fisher model In this section we want to study the genealogy of the Wright Fisher model. We can imagine that each individual in a given generation carries either A or B allele. Assuming no mutation as before, all offspring of A individuals continue to contain only A alleles.

Population Genetics: Wright Fisher Model and Coalescent Process

Coalescent Likelihood Methods

Natural Selection and Coalescent Theory

Frontiers in Coalescent Theory: Pedigrees, Identity-By-Descent, and Sequentially Markov Coalescent Models

Variation in Meiosis, Across Genomes, and in Populations

Rapid Estimation of SNP Heritability Using Predictive Process Approximation in Large Scale Cohort Studies

The Coalescent Model

Multispecies Coalescent Delimits Structure, Not Species

An Introduction to Coalescent Theory

Genealogical Trees, Coalescent Theory and the Analysis of Genetic Polymorphisms

Developments in Coalescent Theory from Single Loci to Chromosomes

Recent Progress in Coalescent Theory

Population Genetics of Identity by Descent Pier Francesco Palamara