: WRIGHT FISHER MODEL AND COALESCENT PROCESS

by

Hailong Cui and Wangshu Zhang Superviser: Prof. Quentin Berger

A Final Project Report Presented In Partial Fulfillment of the Requirements for Math 505b

April 2014 Acknowledgments

We want to thank Prof. Quentin Berger for introducing to us the Wright Fisher model in the lecture, which inspired us to choose Population Genetics for our project topic. The resources Prof. Berger provided us have been excellent learning mate- rials and his feedback has helped us greatly to create this report. We also like to acknowledge that the research papers (in the reference) are integral parts of this process. They have motivated us to learn more about models beyond the class, and granted us confidence that these probabilistic models can actually be used for real applications.

ii Contents

Acknowledgments ii

Abstract iv

1 Introduction to Population Genetics 1

2 Wright Fisher Model 2 2.1 Random drift ...... 2 2.2 Genealogy of the Wright Fisher model ...... 4

3 Coalescent Process 8 3.1 Kingman’s Approximation ...... 8

4 Applications 11

Reference List 12

iii Abstract

In this project on Population Genetics, we aim to introduce models that lay the foundation to study more complicated models further. Specifically we will discuss Wright Fisher model as well as Coalescent process. The reason these are of our inter- est is not just the mathematical elegance of these models. With the availability of massive amount of sequencing data, we actually can use these models (or advanced models incorporating variable population size, eect etc, which are how- ever out of the scope of this project) to solve and answer real questions in molecular biology. First we will explain concepts such as random drift, then discuss if an can eventually get fixed in a population, and what is the probability of genetic vari- ation surviving after generations. After this we will illustrate in graphs the tree like nature of traversing back to most recent common ancestors (MRCA) then derive the distribution of the time back to MRCA for a sample of size 2. For the remainder of the report, we will provide a treatment of Kingman’s approximation. Finally we move on to a literature review of an application to HIV-1 regarding the average coa- lescent estimates of HIV-1 generation time in vivo.

Keywords: Population Genetics, Wright Fisher model, Most Recent Common Ancestors (MRCA), Allele, Sequencing data, Heterozygosity, Genealogy, Coalescent process, Kingman’s approximation, HIV-1

iv Chapter 1

Introduction to Population Genetics

With the advent of new sequencing technology [1, 2], we are harvesting large volume of genetic data (in DNA, RNA and even protein level) and making them publically available [3, 4, 5, 6]. This enables researchers to analyze these sequencing data to tackle one of the most important challenges in modern molecular biology – how to make sense of the variations existing among the genetic information and how these variations are translated into the dierences in phenotypes. For example can one capture the evolution among the tumor cells and use the observed variability to infer the velocity of disease aggravation? Another example is to research the variation among human genetic sequences to extract genes that are related to diseases such as Egr3 gene for Schizophrenia. These are some of the questions in population genetics, and for the scope of this project we aim to first introduce probabilistic models that comprise the basis of further research. Then we like to briefly review some published papers that utilize these models as well as other related methods and software in Phylogeny which all aim to eventually understand diseases better. Essentially the field of population genetics is a study of genetic variation within a population. We assume that a gene has two and denote them by A and B. Then the population is composed of individuals with two copies of each genes, i.e., AA, AB (BA) or BB. It is convenient to classify the evolution problems by employing the time scales involved. A typical question to ask is what will happen in the future, such as “how long does a new mutant survive in the population?” or “what is the chance that an allele gets fixed in a population?”. We can think about these problems from a dierent angle, in other words, retrospectively, by asking where the population has been in the past instead. Many factors can aect the evolution of a population, such as random drift, selection, mutation, recombination, population subdivision etc. Nonetheless we will begin by introducing a simple model with many of these eects ignored in the next section.

1 Chapter 2

Wright Fisher Model

In this section we want to begin by the introduction of the simplest Wright Fisher model (Fisher (1922), Wright (1931)). Here we assume the population is finite, of constant size, and each individual has only two alleles. We also ignore the eects of mutation, selection, etc. We assume this population undergoes random mating. This is what we call random drift, which will be discussed more formally below.

2.1 Random drift

Let’s denote two alleles A and B as before at the locus of interest and assume no mutation occurs. Define Y as the number of A alleles in generation r, then N Y r ≠ r represents the number of B alleles in generation r. First, let’s make the following assumptions:

Assumption 1. Discrete, non-overlapping generations of equal size N.

Assumption 2. Parents of next generation of N genes are picked randomly with replacement from preceding generation (genetic dierences have no fitness conse- quences).

The population at generation r +1is derived from the population at time r by binomial sampling of N genes from a gene pool in which the fraction of A alleles is its current frequency, namely fii = i/N. Hence given Yr = i, the probability that

Yr+1 = j is N j N j pij = fii (1 fii) ≠ , 0 i, j N. A j B ≠ Æ Æ The process Y ,r =0, 1, is a time-homogeneous Markov chain. It has { r ···} transition matrix P =(p ), and state space S = 0, 1, ,N . It is trivial that ij { ··· } the states 0 and N are absorbing; if the population contains only one allele in some generation, then it remains so in every subsequent generation. In this case, we say that the population is fixed for that allele.

2 The binomial nature of the transition matrix makes some properties of the process easy to calculate. For example,

Yr 1 E(Yr Yr 1)=N ≠ = Yr 1, | ≠ N ≠ so by taking expectation on both sides, we get E(Yr)=E(Yr 1), and by recursive ≠ iteration, E(Y )=E(Y ),r=1, 2, r 0 ··· Note that: i E(Y Y = i)=N = i r+1| r N 1 2 i i V ar(Y Y = i)=N 1 r+1| r N ≠ N 1 21 2 Therefore the expected number of A (or B) alleles remains constant across gen- erations, nonetheless variability must be lost eventually. Hence, the population ulti- mately will contain only A alleles or all B alleles. States 0 and N are absorbing states. Naturally we want to understand the probabilities of these events, so we define

ai = P(eventually all alleles are A given that initially only i alleles are A)

Apparently a0 =0, aN =1and Yr is a martingale as can be seen from the above equation. If we define T as the time of absorption at 0 or N and apply the optional stopping theorem, we can get

E[Y ]=N P(Y = N)=N a = i T · T · i so i a = i N This means an allele will eventually become fixed in the population with the same probability as its initial proportion. As a side note, fixation in genetic sequence increases diculty in traversing back in time to determine the common ancestors. The next question of interest is to assess how fast the genetic variation gets lost. To achieve this purpose, let’s study another widely used term in population genetics: heterozygosity. It is defined as a probability Hr that two genes chosen at random with replacement in generation r are dierent.

3 P Yr If we define r = N to be the proportion of A alleles in generation r, then the heterozygosity H =2P (1 P ). r r ≠ r Look at expected heterozygosity:

E(H )=E(2P (1 P )) 1 1 ≠ 1 =2E(P E(P2)) 1 ≠ 1 P P 2 P =2E( 1) E( 1) V ar( 1) 3 ≠ ≠ 4 2 p0(1 p0) =2p0 p0 ≠ 3 ≠ ≠ N 4 1 =2p0(1 p0) 1 ≠ 3 ≠ N 4 1 = H0 1 3 ≠ N 4 After r generations:

1 r E(H )=H 1 r 0 ≠ N 3 r 4 H e≠ N ¥ 0

The probability Hr measures the genetic variability surviving in the population, which decays at rate 1/N per generation. The decrease of heterozygosity is a measure of random drift. As can be seen from above computation, the heterozygosity decays to 0 as r goes to infinity. The expected time for the loss is complicated to compute. As a matter of fact, due to the diculty of finding explicit expression, one may want to resort to approximation method. Interested readers can refer to topics on diusion approximations for further reading [7].

2.2 Genealogy of the Wright Fisher model

In this section we want to study the genealogy of the Wright Fisher model. We can imagine that each individual in a given generation carries either A or B allele. Assuming no mutation as before, all ospring of A individuals continue to contain only A alleles. Below we like to introduce the concept of most recent common ancestors (MRCA) by illustrating two simulation results in Fig 2.1 and Fig 2.2 [7]. Both are for a Wright Fisher model of N =9individuals. Generations are evolving vertically down and the individuals are labelled 1, 2, , 9 from left to right. Lines ··· are directional though without arrows and join individuals in two generations if one

4 Figure 2.1: First simulation. is the ospring of the other. In Fig 2.1, we can see that individual 3 and 4 have the MRCA 3 generations ago. This figure shows very much tangled relationship and may look confusing. The next one Fig 2.2 however presents a more clear structure in a typical phylogenetic tree format. The individual’s order is untangled in Figure 2.2, and we can see that the MRCA of individual 6 and 7 is 11 generations ago, i.e., the root of the tree. Now we want to understand how long it takes for two alleles to travel back to their MRCA. Since individuals choose their parents at random, we see that

1 P(2 individuals have 2 distinct parents) = ⁄ =1 . ≠ N

5 Figure 2.2: Second simulation in untangled form.

Since those parents are themselves a random sample from their generation, we may iterate this argument to see that

P(first common ancestor more than r generations ago) 1 r = ⁄r = 1 . (2.1) 3 ≠ N 4 When the population size is large and time is measured in units of N generations, the distribution of the time to the MRCA of a sample of size 2 has approximately an with mean 1. To see this, rescale time so that r = Nt, and let N in (2.1). We see that this probability is æŒ Nt 1 t 1 e≠ . 3 ≠ N 4 æ

6 Now we consider the probability hr that two individuals chosen with replacement from generation r carry distinct alleles. Two individuals are dierent if and only if their common ancestor is more than r generations ago, and the ancestors at time 0 are distinct. The probability of this event is the chance that 2 individuals chosen without replacement at time 0 carry dierent alleles, and this is just E[2Y (N Y )]/N (N 1). 0 ≠ 0 ≠ Combining these results

N 1 E[2Y (N Y )] h = ⁄r ≠ 0 ≠ 0 = ⁄rh , r N N(N 1) 0 ≠ just as Hr we discussed in previous section. Here are more discrete-time properties:

P(two genes have same parent in the previous generation) is 1 • N Number of generations since two genes first shared a common ancestor • 1 ≥ Geometric( N )

Number of generations since at least two genes in a sample of k shared a • k(k 1) common ancestor Geometric ≠ ≥ 2N 1 2 Proof. Define Gk,k to be the probability that k distinct ancestors in the previous generation. Then

N 1 N 2 N (k 1) Gk,k = ≠ ≠ ≠ ≠ 3 N 43 N 4 ···3 N 4 1 2 k 1 = 1 1 1 ≠ 3 ≠ N 43 ≠ N 4 ···3 ≠ N 4 1+2+3+ +(k 1) 1 O =1 ··· ≠ + 2 ≠ 3 N 4 3N 4 k(k 1) 1 O =1 ≠ + 2 ≠ 2N 3N 4 Therefore, the probability that at least two genes share a common ancestor in the previous generation is

k(k 1) 1 G O 1 k,k = ≠ + 2 ≠ 2N 3N 4 Since this is the same in each generation, we have that the number of generations until k(k 1) at least two genes in a sample of k shared a common ancestor Geometric ≠ . ≥ 2N 1 2

7 Chapter 3

Coalescent Process

In this section we discuss a basic coalescent process. This is tightly related to MRCA introduced in previous sections. Essentially the term coalescence means “connection” or “coming together”, it is the contrary of branching. When two alleles are descended from a common ancestor in some previous generation, we say that they coalesce in that generation. In the previous Wright Fisher model we started from a population of size N then moved “forward” in time to observe descendants. In the coalescent process, we begin from a certain generation and then look “backward” in time at the past. This way the two lineages of two individuals of interest will merge in some previous generation. Let’s begin with the simplest statement of the coalescent model. Kingman proved this to be limiting ancestral process for a broad class of populations structures that includes the Wright Fisher model. We trace the ancestral lineages, which are the series of genetic ancestors of the samples at a locus, back through time. The history of a sample of size n comprises n 1 coalescent events. Each coalescent event decreases ≠ the number of ancestral lineages by one. This takes the sample from the present day when there are n lineages through a series of steps in which the number of lineages decreases from n to n 1, then from n 1 to n 2, etc., then finally from two to one. ≠ ≠ ≠ The single lineage remaining at the final coalescent event is the MRCA of the entire sample. At each coalescent event, two of the lineages fuse into one common-ancestral lineage. The result is a bifurcating tree like the one shown in Fig 3.1. The times

Ti on the right in Fig 3.1 are the times during which there were exactly i lineages ancestral to the sample.

3.1 Kingman’s Approximation

Discrete-time models can be cumbersome to work with, thus we would like a rep- resentation in continuous time. Kingman (1982) considered the case where N (pop- ulation size) is very large relative to n (sample size). Recall that Gk,k = probability that k genes had k ancestors in the previous generation. Define Gi,j = probability

8 Figure 3.1: A coalescent genealogy of a sample of n =9items. that i genes had j(jt)=probability that the time to a coalescent event in a sample of i • i lineages in from a population of size N is greater than t

[Nt] P(Ti >t)=(Gi,i)

9 For the Wight Fisher model:

[Nt] [Nt] i(i 1) P(Ti >t)=(Gi,i) = 1 ≠ 3 ≠ 2 N 4 i t · e≠(2) as N æ æŒ In this case, with appropriate time units, the time to coalescence in a sample of i 2 lineages follows an Exponential µ = i(i 1) distribution. ≠ The probability density function1 for Ti 2is

i i (2)t fTi (t)= e≠ ,t 0,i=2, 3, ,n A2B Ø ···

The mean and are 2 E(T )= i i(i 1) ≠ 2 2 V ar(T )= i i(i 1) 3 ≠ 4 Fewer lineage means longer expected time to coalescence. To generate a genealogy of i genes under Kingman’s coalescent:

Draw an observation from an exponential distribution with mean µ =2/(i(i • ≠ 1)). This will be the time of the first coalescent event (looking from the present backwards in time).

Pick two lineages at random to coalescence. • Decrease i by 1. • If i =1, stop. Otherwise, repeat these steps [8, 9]. •

10 Chapter 4

Applications

In this section we like to start by a paper review of coalescent estimates of HIV-1 generation time in vivo [10]. Though a bit outdated, this paper shows us how a new method based on coalescent theory can be used to esimate HIV-1 generation time in vivo. The estimated generation time in HIV-1 had been of many researchers’ inter- est and had previously been estimated by a dierent mathematical model of viral dynamics. The first author Allen Rodrigo (now a professor at Duke) used sequencing data for the analysis, and a reconstructed genealogy of sequences obtained over time. The study was on one single individual, a homosexual Caucasian male who was diagnosed as HIV-1 positive following an episode of aseptic meningitis in February of 1985, when he was 23 years old. Over the course of 3 years begin- ning in 1989, blood was obtained at time points 7, 22, 23, and 34 months after the first specimen. The method is applied to sequences obtained from a long-term non- progressing individual at above five sampling occasions. The estimated average of viral generation time using the coalescent method was 1.2 days per generation and is close to that obtained by mathematical modeling (1.8 days per generation), thus strengthening confidence in estimates of a short viral generation time. Readers interested in more recent papers with application to sequence data can refer to 2002 Nature paper by Noah Rosenberg [11] (now a professor at Stanford). The authors discussed the increased use of genetic polymorphism for inference about population phenomena, such as migration and selection and employed the coalescence process for their analysis. Beyond the scope of our stochastic modeling, there are also dierent approaches using to infer phylogentic trees, estimate the rates of molecular evolution etc. Readers can refer to a well established software package called MEGA [12].

11 Reference List

[1] E. R. Mardis, “Next-generation sequencing methods,” Annu. Rev. Genomics Hum. Genet., vol. 9, pp. 387–402, 2008.

[2] M. L. Metzker, “Sequencing technologies – the next generation,” Nature Reviews Genetics, vol. 11, no. 1, pp. 31–46, 2010.

[3] P. J. Cock, C. J. Fields, N. Goto, M. L. Heuer, and P. M. Rice, “The sanger fastq file format for sequences with quality scores, and the solexa/illumina fastq variants,” Nucleic acids research, vol. 38, no. 6, pp. 1767–1771, 2010.

[4] D. Karolchik, R. Baertsch, M. Diekhans, T. S. Furey, A. Hinrichs, Y. Lu, K. M. Roskin, M. Schwartz, C. W. Sugnet, D. J. Thomas, et al., “The ucsc genome browser database,” Nucleic acids research, vol. 31, no. 1, pp. 51–54, 2003.

[5] T. Hubbard, D. Barker, E. Birney, G. Cameron, Y. Chen, L. Clark, T. Cox, J. Cu, V. Curwen, T. Down, et al., “The ensembl genome database project,” Nucleic acids research, vol. 30, no. 1, pp. 38–41, 2002.

[6] K. D. Pruitt, T. Tatusova, and D. R. Maglott, “Ncbi reference sequence project: update and current status,” Nucleic acids research, vol. 31, no. 1, pp. 34–37, 2003.

[7] S. Tavaré, “Part i: Ancestral inference in population genetics,” in Lectures on probability theory and statistics, pp. 1–188, Springer, 2004.

[8] J. Wakeley, “Chapter 3 of coalescent theory: An introduction.” http:// webpages.uidaho.edu/hohenlohe/Wakeley_ch3.pdf, cited April 2014.

[9] L. Kubatko, “Tutorial on coalescent theory.” http://www.stat.osu.edu/ ~lkubatko/coalescent_theory_penn_state_part1.pdf, cited April 2014.

[10] A. G. Rodrigo, E. G. Shpaer, E. L. Delwart, A. K. Iversen, M. V. Gallo, J. Bro- jatsch, M. S. Hirsch, B. D. Walker, and J. I. Mullins, “Coalescent estimates of hiv-1 generation time in vivo,” Proceedings of the National Academy of Sciences, vol. 96, no. 5, pp. 2187–2191, 1999.

12 [11] N. A. Rosenberg and M. Nordborg, “Genealogical trees, coalescent theory and the analysis of genetic polymorphisms,” Nature Reviews Genetics, vol. 3, no. 5, pp. 380–390, 2002.

[12] K. Tamura, G. Stecher, D. Peterson, A. Filipski, and S. Kumar, “Mega6: Molec- ular evolutionary genetics analysis version 6.0,” Molecular biology and evolution, vol. 30, no. 12, pp. 2725–2729, 2013.

13