<<

University of Calgary PRISM: University of Calgary's Digital Repository

Graduate Studies The Vault: Electronic Theses and Dissertations

2019-12-23 Direct Solutions of the Wright-Fisher Model

Kryukov, Ivan

Kryukov, I. (2019). Direct Solutions of the Wright-Fisher Model (Unpublished doctoral thesis). University of Calgary, Calgary, AB. http://hdl.handle.net/1880/111434 doctoral thesis

University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. Downloaded from PRISM: https://prism.ucalgary.ca UNIVERSITY OF CALGARY

Direct Solutions of the Wright-Fisher Model

by

Ivan Kryukov

A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

GRADUATE PROGRAM IN BIOCHEMISTRY AND MOLECULAR BIOLOGY

CALGARY, ALBERTA DECEMBER, 2019

c Ivan Kryukov 2019 Contents

List of figures v

List of tables viii

Abstract x

Preface xii

Acknowledgements xiii

1 Introduction 1 1.1 The Wright-Fisher model ...... 1 1.1.1 Transition probability matrix of the Wright-Fisher model ...... 2 1.1.2 Selection and ...... 3 1.1.3 Probability of fixation ...... 6 1.2 Applications in evolutionary biology ...... 9 1.2.1 Mutation-selection limitations ...... 11 1.3 Direct computation of substitution rate ...... 12

2 Wright-Fisher Exact Solver 16 2.1 Introduction ...... 17 2.2 Results ...... 18 2.2.1 Implementation ...... 18

i 2.2.2 Evaluation ...... 19 2.2.3 Discussion ...... 19 2.3 Methods ...... 20 2.3.1 Finite absorbing Markov chain theory ...... 21 2.3.2 Rapid solution of restricted linear systems ...... 23 2.3.3 Parameterization of the Wright-Fisher model ...... 24 2.3.4 Rapid calculation of the transition matrix ...... 26 2.4 Additional Results ...... 26 2.4.1 Comparison of Exact and Approximate Results ...... 26 2.4.2 Effect of Truncation ...... 28

3 Allele Age 30 3.1 Introduction ...... 31 3.2 Results ...... 34 3.2.1 Validation by comparison to other methods ...... 35 3.2.2 Computational advantages of the exact approach ...... 37 3.2.3 Direct demonstration of classical results ...... 38 3.2.4 Selective strolls and stochastic slowdowns ...... 40 3.2.5 Fast mutation and age imbalance ...... 40 3.2.6 Allowing the starting number of copies to vary ...... 42 3.3 Discussion ...... 46 3.4 Materials and Methods ...... 48 3.4.1 Theory ...... 48 3.4.2 Implementation ...... 50 3.5 Simulations ...... 53 3.6 Supporting information ...... 54 3.7 Supplementary methods ...... 54

ii 4 Modelling Time-Heterogeneous Evolution and Changing Population Size 60 4.1 Abstract ...... 60 4.2 Introduction ...... 61 4.2.1 Background ...... 62 4.2.2 Consideration of selection ...... 64 4.2.3 From fixation probabilities to allele frequency spectra ...... 64 4.3 Methods ...... 65 4.3.1 Time-homogeneous Wright-Fisher model ...... 65 4.3.2 Time-heterogeneous Wright-Fisher model ...... 68 4.3.3 Allele frequency spectrum calculation ...... 70 4.4 Results ...... 73 4.4.1 Fluctuating population size ...... 73 4.4.2 Increasing population size ...... 74 4.4.3 Distribution of allele frequencies ...... 80 4.5 Conclusions ...... 84 4.6 Supplement ...... 85 4.6.1 Supplementary methods ...... 85 4.6.2 AFS approximation scaling ...... 87 4.6.3 Supplementary figures ...... 88

5 Rate of Substitution with Standing Genetic Variation 93 5.1 Abstract ...... 93 5.2 Introduction ...... 94 5.3 Methods ...... 97 5.3.1 Rate of approach to equilibrium ...... 97 5.3.2 Rate of substitution in the Wright-Fisher model ...... 100 5.3.3 Modelling single-origin selective sweeps ...... 101

iii 5.3.4 Modelling multiple- and single-origin selective sweeps by recurrent mu- tation ...... 102 5.3.5 Modelling multiple- and single-origin selective sweeps by recurrent mu- tation and standing genetic variation ...... 102 5.3.6 Simulations ...... 103 5.4 Results ...... 104 5.4.1 Approach to equilibrium ...... 104 5.4.2 Finite-sites substitution rate with bidirectional mutation, selection, and SGV ...... 109 5.4.3 Validation by simulation ...... 112 5.4.4 Sojourn times prior to absorption ...... 112 5.5 Conclusions ...... 115 5.5.1 Data availability ...... 116 5.6 Supplement ...... 116 5.6.1 Solving for equilibrium ...... 116 5.6.2 Approach to equilibrium via spectral decomposition ...... 117 5.7 Wright-Fisher simulation code ...... 120 5.8 Validation by simulation ...... 121 5.9 Supplementary figures ...... 122

6 WFES2: New models, computations, and improved performance 124 6.1 Abstract ...... 124 6.2 Introduction ...... 126 6.2.1 Available computations ...... 128 6.3 Methods and Results ...... 132 6.3.1 Time-dependent distributions ...... 132 6.3.2 Discrete phase-type distributions ...... 134 6.4 Small population size approximation and truncation ...... 137

iv 6.5 Conclusions ...... 139 6.6 Supplement ...... 140 6.6.1 Adjustable sparsity threshold ...... 140 6.6.2 Distributions of time to absorption with selection ...... 140

7 Future Directions 143 7.1 Multiple Alleles ...... 144 7.2 Comparison to diffusion theory ...... 145 7.3 Coalescent ...... 146

v List of Figures

1.1 Change of number of copies per Wright-Fisher generation ...... 5

2.1 Relative probabilities of fixation ...... 19 2.2 Probabilities of fixation ...... 27 2.3 A small N = 10 WF transition matrix ...... 28

3.1 Distributions of allele age by simulation ...... 37 3.2 Expected allele age and variance as a function of selection, dominance, and mutation rate ...... 39 3.3 Expected extinction and fixation times when mutation is strong ...... 41 3.4 Difference in conditional sojourn times (compared to neutral) for selected alleles going to extinction ...... 43 3.5 Effect of integrating out uncertainty in p ...... 45 3.6 Expected allele age and variance - larger range of mutaiton rates ...... 55 3.7 Simulated neutral allele age distributions. Larger range of mutation rates. . . 56 3.8 Simulated non-neutral allele age distributions (θ = 0.01) - larger range of mutation rates...... 56 3.9 Simulated non-neutral allele age distributions (θ = 0.96) - larger range of mutation rates ...... 57

4.1 Markov-Modulated Wright-Fisher model ...... 68 4.2 Fluctuating population size in a reversible switching model MMWF vs HM . 75

vi 4.3 Instantaneous doubling of population size from N1 = 1000 to N2 = 2000, after average of t generation - MMWF vs, HM ...... 78

4.4 Instantaneous doubling of population size from N1 = 1000 to N2 = 2000, after

average of 5, 000 generations in N1, with variable selection ...... 79 4.5 Full allele frequency spectra after a non-equilibrium demography for neutral variants ...... 81 4.6 Full allele frequency spectra after a non-equilibrium demography for deleteri- ous variants ...... 81 4.7 Changes in allele frequency distribution function starting at a single allele copy, over single generations ...... 83 4.8 Example simulation of increasing population size ...... 89 4.9 Coefficient of variation for the time to fixation ...... 90 4.10 Ratio of probabilities of fixation between MMWF and HM ...... 91 4.11 Full allele frequency spectrum when using different levels of approximation . 92

5.1 Convergence to equilibrium for different selection coefficients ...... 104 5.2 Time to convergence to eqilibrium ...... 106 5.3 Time to convergence for variable mutation rate and population size . . . . . 107 5.4 Standing genetic variation increases the corresponding rate for the next sub- stitution ...... 110 5.5 Ratios of rates of substitution ...... 111 5.6 Comparison of simulated versus calculated substitution rates under different selection strengths ...... 113 5.7 Number of generations spent in each epoch prior to fixation in the terminal epoch ...... 114 5.8 Example equilibrium distributions ...... 117 5.9 Approach to equilibrium caclulation comparison ...... 119 5.10 Simulated number of extinction trials before fixation in RM and RM+SGV . 121

vii 5.11 Means and standard deviations of simulated times between substitutions . . 122 5.12 Rate deviation between evolutionary rates for a large parameter grid . . . . 123

6.1 Probability distribution of time to fixation in the Wright-Fisher model . . . 134 6.2 Probability distribution of time to next substitution in the Wright-Fisher model136 6.3 Probability distribution of time to fixation or extinction (neutral case), with different approximating population sizes ...... 138 6.4 Probability distribution of time to fixation or extinction, with different ap- proximating population sizes for a moderately beneficial variant ...... 141 6.5 Probability distribution of time to fixation or extinction, with different ap- proximating population sizes for a strongly beneficial variant ...... 142

viii List of Tables

2.1 WFES performance with truncation ...... 29

3.1 Parameters used throughout this study and their meanings ...... 35 3.2 Comparison of calculated expected neutral allele age ...... 35 3.3 Representative expected allele age and variance ...... 36 3.4 Allele age calculation run time ...... 38

4.1 Computation time for the AFS using different methods, as a function of pop- ulation size ...... 83 4.2 Computation time for the AFS using different methods, as a function of the epoch length ...... 84 4.3 Computation time for the AFS using different approximating factors . . . . . 88

6.1 WFES2 single population calculations ...... 130 6.2 WFES2 MMWF calculations ...... 131 6.3 WFES2 allele frequency calculations ...... 131 6.4 WFES2 probability distribution calculations ...... 132 6.5 First 20 moments of the distribution of time to substitution ...... 135

ix Abstract

Population genetic models are fundamental to how we understand and infer the forces that patterned genome sequence variations in natural populations and across species. The most elemental model of population genetics is the discrete-time Wright-Fisher Markov model (WF ), which mathematically describes the time evolution of genetic variation in idealized populations. WF is conveniently flexible and can account for a wide variety of important deterministic and stochastic forces, but it can be quite difficult to analyze. Often WF is studied by Monte Carlo methods, which can be time consuming and imprecise for rare events and random variables with broad distributions. Alternatively, WF can be studied mathe- matically using continuous-time diffusion approximations that were designed for convenient mathematics in an era that predated the wide availability of modern computers. In evolu- tionary genetics and molecular evolution, diffusion approaches remain in wide use, and are often deployed even when there is little compelling reason to do so (e.g., for models that lack closed-form solutions, requiring numerical integration than can be unstable and even misleading for unknown ranges of parameter choices). An important direction, now that we have access to both great data and expansive computational resources, is to evaluate the biological reasonableness of traditional simplifying assumptions, and to explore the con- sequences of relaxing those assumptions for how we understand fundamental evolutionary processes. In this dissertation, I build a comprehensive set of models and computational methods for rapidly and directly analyzing Markov chain models in population genetics without making

x any of the traditional simplifying assumptions like infinite sites, weak mutation, and weak selection. Our interest in these approaches was initially born out of a desire to better understand and model the population genetics of molecular evolution. However, it is my hope that this body of work will help stimulate renewed interest in Markov chain methods in population genetics in general, and that it may pave the way for more realistic, mechanistic, and tractable models of molecular evolution.

xi Preface

This work was performed as part of the doctoral research program in the lab of Dr. A.P. Jason de Koning. Much of the work has been done in collaboration with Bianca de Sanctis. Both have contributed greatly to the work detailed here. The acknowledgements of the specific contributions can be found in each chapter.

Published work

1. Chapter 2 has been published as an application note in Oxford Bioinformatics, 2017, as Krukov et al. [1]

2. Chapter 3 has been published as a research article in Scientific Reports (Nature), 2017, as de Sanctis et al. [2].

Both of the papers have been published as open access.

xii Acknowledgments

I would like to thank my supervisor A.P. Jason de Koning, and the chief co-author Bianca de Sanctis. Without them, this work would not be possible. Through the highs and the lows, we have preserved our fascination and optimism about science. I will never forget the moments of excitement when would put everything on hold, just to share a new idea. To my committee members, Sam Yeaman and Nic Rodrigue, thank you for your support in the long and arduous process. I would like to thank Sonja, who always encouraged me to keep going, even when it seemed futile. I owe everything to you. A big thanks to my Mom and Dad who are always proud of me. I would also like to thank my department of Biochemistry and Molecular Biology, faculty, and staff, who always valued student success above all. A very special gratitude to the Alberta Innovates Technology Futures fund, that sup- ported this and several other projects in the lab. A big thanks to the Queen Elizabeth II Graduate scholarship that has funded parts of this work. I would also like to thank all the past and present members of the de Koning lab. Our family is one of a kind. Sincerely, thank you.

xiii Chapter 1

Introduction

The Wright-Fisher model (WF ) [3, 4] is central to both classical and modern population ge- netics. WF describes allele frequency dynamics in a finite population due to random and directional forces, such as selection or mutation. The model is conveniently easy to state mathematically, but is notoriously hard to analyze [5]. As a result, many theoretical advances have been made through approximations and judiciously simplified models. We start with a review of the classical WF model, which describes a single panmictic population of constant size. The model was known to both Fisher and Wright, as they both derived results directly from it. However, it is interesting that neither wrote down the formulation explicitly [5]. The core aspect of the model is to describe the random genetic drift as a random sampling process over generations in time. This counterposes it to the deterministic models of populations (see [6] for an accessible review of the mathematical modelling in genetics). We then consider the applications in evolutionary genetics via the derivation of the expected substitution rate on a phylogenetic tree.

1.1 The Wright-Fisher model

A useful population model should describe the allele states in the population under random genetic drift, and be easily extendable to other forces of evolutionary interest. It should

1 take into account the size of the population, as well as the effects of selection and mutation. Ideally, it should generalize to multiple loci in order to describe whole genomes. The model should inform our expectations about the amount of genetic variability in the population, and allow the derivation of other properties of interest, both short- and long-term. The Wright-Fisher model satisfies many of these requirements. We begin by deriving the transition probability matrix for the Wright-Fisher model, following the formulation in [5]. We start with the basic drift-only model, and then add selection and mutation effects. We then show how the WF transition probability matrix can be used to derive the probability of fixation for a new allele, a core statistic in evolutionary biology. We then discuss the applications of this probability to the rate of substitutions on a phylogeny.

1.1.1 Transition probability matrix of the Wright-Fisher model

Consider a single locus with two alleles, wildtype a and derived A, within a population of size N of diploid organisms. Since every diploid organism carries 2 copies of its genome, the total number of genomes (chromosomes) within the population is 2N. Given the current generation with i copies of allele A, the next generation is produced with a binomial draw from the current one. The binomial distribution is appropriate here since the model has two states (a and A), and we are keeping track of each allele in the population. Mathematically, this gives the probability of transition from i to j copies of allele A within one generation:

2N P = (ψ )j(1 − ψ )2N−j, (1.1.1) i,j j i i

2N Above, is the binomial coefficient, giving the number of ways to draw j alleles j from a sample size of 2N. The term ψi is the binomial sampling probability, which represents the Bernoulli success probability of each draw in the population.

2 This equation yields a system of binomial distributions, one for each i. This set consti- tutes the per-generation transition probability matrix from i to j copies of A. This matrix is of central importance in this work. Note that the transition probability matrix forms a Markov chain, as the count of A in following generation only depends on the present generation. This feature will allow us to use standard Markov chain theory to describe the behaviour of the model.

In the simplest case, with only drift acting in the population, ψi equals the current allele

i frequency of A: ψi = 2N . The mean of our binomial distribution, or the expected number of copies of A in the next generation, is 2Nψ, which is simply i in this case. Thus, under drift alone, the expected change in allele number per generation is zero.

1.1.2 Selection and mutation

The transition probabilities can be modified to account for selection. Selection acts on the organism level, not the allele level. In our model, an organism is identified by its genotype, so we need a way to translate between allele and genotype frequencies. A standard way to do this is the Hardy-Weinberg function. As before, i is the count of A in the current generation:

Genotype Genotype frequency

i 2 AA ( 2N ) i i Aa 2( 2N )(1 − 2N ) i 2 aa (1 − 2N ) Selection is the relative advantage of an individual with a particular genotype. In the WF model, selection will modify the probability of an individual to be sampled in the next gen- eration. For example, under positive selection, the probability of sampling should increase. Since we are dealing with biallelic individuals, we might want to consider dominance, which is the differential effect of carrying one or two derived alleles. One possible parameterization (which is conventional in the field, see [5]) is to assign a selection coefficient s to the derived allele A with respect to a neutral wildtype allele a:

3 Genotype Fitness

AA 1 + s Aa 1 + sh aa 1

In the absence of selection, s = 0, all genotypes have a fitness of 1, making them equally likely to be sampled in the next generation. The coefficient h expresses the fraction of the selection effect that the heterozygote Aa experiences. With h = 1, the allele is dominant, such that both the heterozygote Aa and homozygote derived AA are equivalent. With h = 0, the allele is recessive (Aa and aa are equivalent). Then ψ should account for the contributions of selection, and be appropriately normal- ized. We multiply the Hardy-Weinberg frequencies of the genotypes carrying A by their respective fitness values:

(1 + s)( i )2 + (1 + sh) i (1 − i ) ψ = 2N 2N 2N (1.1.2) i i i i i 2 (1 + s) 2N + 2(1 + sh) 2N (1 − 2N ) + 1(1 − 2N )

i Note that the equation above simplifies to 2N when s = 0, as desired. With s > 0, the expected per-generation allele frequency change is positive, so they tend to increase with positive selection. Figure 1.1 shows the expected difference in the number offspring per generation. Mutation can be added in a similar manner, by modifying the Hardy-Weinberg frequen- cies. We can differentiate between two types of , one from a to A (forward, v), and the other from A to a (backward, u). The forward mutation will convert existing a to A, and should thus multiply the current count of a. The backward mutation removes A by converting them to a, and 1 − u should thus be the multiplier of the current count of A. We thus arrive to our final version of ψi:

4 2Ns 0.2 2.0 1.8 1.6 1.4 1.2 0.1 1.0 0.8 0.6 0.4 0.2 0.0 0.0 -0.2 -0.4 -0.6 0.1 -0.8 -1.0 -1.2 -1.4 Expected change in the number of offspring -1.6 0.2 -1.8 -2.0

0 250 500 750 1000 1250 1500 1750 2000 Number of copies

Figure 1.1: Change in the number of copies per generation in the Wright-Fisher model with different selection coefficients. The change per generation is symmetric around neutrality (grey line).

5 (1 + s)( i )2 + (1 + sh)( i )(1 − i )(1 − u) + (1 + sh)( i )(1 − i ) + (1 − i )2v ψ = 2N 2N 2N 2N 2N 2N i i 2 i i i 2 (1 + s)( 2N ) + 2(1 + sh)( 2N )(1 − 2N ) + (1 − 2N ) (1.1.3)

We now have a reasonably descriptive version of ψi that we can use in the transition prob- ability matrix of (1.1.1). The model describes the change of allele counts every generation and accounts for drift, selection, dominance, and bidirectional mutation.

1.1.3 Probability of fixation

The statement of the probability matrix (1.1.1) and the inclusion of evolutionary forces in (1.1.3) is not yet useful, as it only describes the changes within one generation. We would like to expand this to derive long-term properties as well. The probability of fixation of a new allele in a population is a classical example of such property, as it has many applications. The classical method for finding the probability of fixation is via the diffusion approxi- mation, which uses a continuous representation of the Wright-Fisher model. The derivation of this and many other important properties in the diffusion framework has been described in a series of papers by Kimura (see [7] for an accessible review). Instead of using the diffusion approximation, we can take a more direct approach of analysing the Wright-Fisher Markov chain (1.1.1) directly. While more computationally intensive, this approach is more general and requires less assumptions about the model. Assume that we are starting with a single unique copy of allele A in a population that is otherwise all a. The allele A will exist in a population for some finite number of generations, before either going to extinction (zero copies of A), or fixation (2N copies of A). The fixation is much less likely, as we start with a single copy of A. We would like to derive the probability of fixation of our new allele. As pointed out previously, the Wright-Fisher model is a Markov chain, since the future states only depend on the present. Furthermore, since the derived allele A will eventually

6 either fix or go to extinction, we can treat these states as absorbing. In other words, in this setting we only keep track of the allele frequency of A until it either reaches 2N or 0, and stop after that. In terms of the transition probability, this means that once we reach either of these states, we stay in them with probability 1. The states of any absorbing Markov chain can be partitioned into two classes: absorbing and non-absorbing (transient). We can partition the transition probability matrix P (1.1.1) in the following way:

  QR   P =   (1.1.4) 0 I2

Above, Q includes the transient-to-transient transition probabilities, and R transient- to-absorbing. I2 are the absorbing-to-absorbing transition probabilities, which with the two absorbing states (0 and 2N) is a 2 × 2 identity matrix.

Every absorbing Markov chain has a fundamental matrix, Nij, that expresses the number of generations the model spends in state j given that it started in state i, before eventually absorbing. The matrix can be found from the inverse of I − Q:

Ni,j = {Ni,j = E(t(j)|n0 = i); i, j ∈ A} (1.1.5) ∞ X N = Qk = (I − Q)−1 (1.1.6) k=0

The interested reader is referred to [8, theorem 3.2.4], for the proof that the fundamental matrix indeed has this interpretation. This matrix provides a rich source of information about the long term behaviour of the model. We can use the fundamental matrix to find the probability of fixation of an allele, as we show presently. The number of generations before absorption, given that we started in

1 copy is given by the first row of the fundamental matrix (N1,j). The second column of

7 R (Ri,2) corresponds to the probability of going from allele count i to fixation within one generation. The dot product of these two vectors is the time spend in each state weighted by the probability of fixation from each state, which is the probability of fixation:

P rfix = N1,jRi,2 (1.1.7)

The probability of extinction, being the alternative outcome is one minus the probability of fixation P rext = 1 − P rfix. In general, the probability of either extinction or fixation starting in any state is given by the matrix B = NR. Once the fundamental matrix is known, many other properties of interest can be derived in a similar manner, which are explored in the rest of the thesis. In this dissertation, I employ the variations of this direct approach to the analysis of WF. Instead of approximating the underlying random process and making mathematically con- venient assumptions about the strength of various forces, I compute an exact representation of the model by constructing full transition probability matrices for the un-approximated Markov chain model of interest, as demonstrated above. The WF transition probability ma- trices are sparse, and often very sparse at machine precision. Therefore we can apply high performance linear algebra techniques to rapidly derive the properties of the model. This approach allows many elegant and powerful techniques from the vast literature on Markov chain theory to be exploited. A wide array of models, methods, and computations have been developed throughout, which I have implemented in our software package, Wright-Fisher

Exact Solver (WFES). Our direct computational approach has two major advantages. First, we do not have to limit the effects of directional forces, such as mutation, and can explore the behavior of WF in non-classical parameter regimes. Second, we are free to construct extensions to the basic WF (by considering different forms of (1.1.1), and derive its properties directly. Typically, this means we are free to modify the biological assumptions and forces consid- ered in the model and the computation of various properties, statistics, probabilities, and

8 probability distributions will ‘just work’. This model-first approach is intended to encourage further development and exploration of more realistic models in both population genetics and molecular evolution.

1.2 Applications in evolutionary biology

The original motivation for this work was to expand the mutation-selection (MS) models of sequence evolution. Mutation-selection models have been increasingly popular in evolution- ary biology and phylogenetics (see [9] for a review). These models have been first introduced by Golding and Felsenstein [10], and further developed by Halpern and Bruno [11]. These models offer an attractive mechanistic description of the population-genetic process that produces fixed substitutions between lineages. In this section we focus on the applications of WFES to MS models in order to improve the modelling realism. A desirable feature of the direct analysis framework is that all the population genetic parameters are considered in the context of the model. Typically, for a tractable mathe- matical treatment of random genetic drift, mutation, and selection are considered separately [5, 10, 11]. Instead, we can include all the relevant forces directly in the analysis, without changing the overall framework. We can use the same approach to treat a number of different models, which might not be possible in other settings. The rate of substitution is a core property underlying MS models, as it dictates the expected number of differences observed between two sequences. The common way to derive the expected rate of substitution is by considering it as a biphasic process. The first step is (1) mutation, which introduces new variants into the population, and (2) eventual fixation (or extinction) of the novel variant. This is what McCandlish and Stoltzfus [12] refer to as “origin-fixation” models. This approach greatly simplifies the expressions for the rate of evolution, K, as the product of the population mutation rate (2Nµ) and the probability of

fixation (Pfix(s, N)):

9 k = 2NµPfix(s, N) (1.2.1)

Note that in this formulation, it is ambiguous what the mutation rate µ refers to - it could be interpreted either as the genome-wide, or as a site-specific rate of mutation. The latter interpretation is applied in MS models, as we discuss below.

Interpretation of µ

The rate of substitution described above can be interpreted in two different ways - in the infi- nite sites model, or the single-site sequential substitution model. These models mainly differ in their interpretation of the mutation rate parameter µ. Below, we follow the exposition in [12]. In the infinite sites model (or equivalently here, the infinite alleles model), each mutation occurs at a new and previously unchanged locus. In this case, the mutation rate µaggregate refers to the aggregate mutation rate across the entire set of loci per generation. Since only a fraction Pfix(N, s) of new mutations (2Nµaggregate) reach fixation, the rate of substitution K is the fraction of sites that fix.

Alternatively, we can interpret µ as µlocus, a per-locus per-generation mutation rate, and focus our attention at a single site. The single-site assumption is common in evolutionary biology, as sites are often treated independently (see however [13]). The model assumes that the mutation rate is small enough such that any new mutation occurs at a currently monomorphic site (that could have undergone substitutions previously). With a sufficiently small mutation rate, the time spent waiting for a mutation is significantly longer than the time to reach fixation (conditional on fixation actually taking place) [10]. The standard simplification is to describe this process as a continuous time Markov chain, where a site is switching between being fixed for different alleles, with no additional mutations happening

10 in-between. The rate of substitution from genotype x to y is then given by:

Kx,y = 2NµlocusPfix(sy − sx,N) (1.2.2)

The time between fixations for such a continuous time Markov chain is then exponentially distributed with mean K: Tbfix ∼ exp(K). where sx and sy are the selection coefficients of genotypes x and y respectively. This can be further generalized to substitutions between different genotypes, e.g. codons at a site, which is the basis of the MS models.

1.2.1 Mutation-selection limitations

The mutation-selection models are based on such single-site interpretation. The infinite sites assumption is inappropriate for comparison between different species, since loci with more that two states are commonly observed in across-species comparisons (which the domain of application of MS models). There are limitations to the sequential substitutions model, however. First is the necessity for a small mutation rate. From diffusion theory, we know that the expected time to fixation of a neutral variant is 4N generations on average. Given 2Nµ mutations entering every generations, and the requirement that no new mutation mutations happen in this period, the model apparently requires that 4N × 2Nµ = 8N 2µ < 1. This imposes a strong lower bound on the mutation rate where the sequential substitution model is applicable. With the work presented here, we can resolve this limitation in the Wright-Fisher context. The second limitation is that the model only considers two alleles per fixation cycle. While many alleles are possible, the model is sequentially biallelic. To be more general, the model requires considerations of multiple alleles. We do not directly address this limitation here, but we make several suggestions for future directions. The third limitation is the lack of consideration of interaction between sites. In the

11 evolutionary setting, linkage between sites should not be a major factor. Since the diver- gence times under consideration are large, full recombination appears to be a reasonable assumption.

1.3 Direct computation of substitution rate

The Wright-Fisher Exact Solver (WFES) implements tools for direct analysis and solutions for the properties of the Wright-Fisher model. On one side, we can solve for the classical prop- erties such as the probability and time to fixation. The calculations are general, and work for any parameters, without the limitations of the diffusion approximation. Additionally,

WFES can be used to calculate probability distributions that are not directly available from standard theory. Pertaining to the mutation-selection model, we can calculate the probabil- ity distribution of times between fixation without needing to assume weak mutation. Not only do we avoid the diffusion approximation, but we also do not require the origin-fixation formalism. Instead, we directly calculate the relevant distribution from the Wright-Fisher model. The main premise of the origin-fixation approach is to specify the rate of evolution in terms of the parameters of the underlying population. In the mutation-selection (MS) setting, the rate of substitution determines transition probability rates in a continuous-time Markov chain. The wait time for the next substitution is then exponentially distributed with the mean being the rate of evolution. Thus the core the desired quantity is a probability distribution of times to the next substitution. This probability distribution can be derived directly from the Wright-Fisher model. Given an initial probability distribution over the allele frequencies f(0) (for example point mass at 1 copy for novel alleles), we want to calculate the probability of absorption in state k at a particular time f(t):

12 X t−1 t−1 P rk(t) = Ri,k(f(0)P ) = Rk · (f(0)P ) (1.3.1) i This iterative computation essentially iterates the matrix-vector multiplication. The probability of absorption is calculated by taking the current allele frequency distribution at time t, and weighing it by the probability of going into an absorbing state in that generation. This calculation provides a distribution of times between fixation. Specifically, we want to model a population that is monomorphic for the ancestral allele, so the starting state is 0 copies of A. Each generation, there is a non-zero probability of absorbing into the fixation state from state. The summation over all these probabilities gives the total probability of absorbing at generation t. This approach gives us a straightforward way to calculate the distribution of time between substitutions directly from the Wright-Fisher model, sidestepping any assumptions about parameter strength. We also do not make the biphasic assumption, and treat mutation and allele segregation as a part of the same stochastic process. This gives us the opportunity to calculate results directly from the model. In many cases, this proves to be a fruitful approach. In this dissertation we explore a number of applications of the Wright-Fisher model to the questions of evolutionary interest. We present models with numerous possible applications.

Overview of the dissertation

Chapter 2 - Wright-Fisher Exact Solver

In chapter 2, we present the general approach to the (algebraically) exact computational analysis of the Wright-Fisher model, as implemented in Wright-Fisher Exact Solver (WFES). We demonstrate the ability to calculate the probabilities and expected times to absorption of alleles in a finite population, and present the derivation of these properties in the Markov

13 chain context. We show some known limitations of the diffusion approximation, and demon- strate the robustness of WFES to standard assumptions. This material has been published in Oxford Bioinformatics [1].

Chapter 3 - Allele age

Chapter 3 examines moments of the probability distribution of the number of generations an allele existed in a population given some observed frequency. Here we calculate and examine the mean, variance, and higher moments of allele age in the WFES framework. The chapter demonstrates the applicability of the direct analysis methodology by applying it in a wide parameter range. We also describe new examples of stochastic slowdowns and other phenomena, where, for example, slightly deleterious alleles are shown to be older than neutral alleles observed at the same frequency. This work has been published in Scientific Reports (Nature) [2].

Chapter 4 - Markov-Modulated Wright-Fisher models

Chapter 4 describes an extension to the Wright-Fisher model to time-heterogeneous pa- rameters - the Markov-modulated Wright-Fisher model (MMWF ). By deriving a transition probability matrix for instantaneous parameter changes, we implement models of variable population size in WFES. We compare our model to standard diffusion results, and show new limitations of widely used approaches including the harmonic mean estimator of effective population size. We develop exact and approximate approaches for rapidly calculating allele frequency spectra (AFS).

Chapter 5 - Adaptation from standing genetic variation

In chapter 5 we consider two problems. First, on the basis of MMWF, we construct a model of the rate of adaptation from standing genetic variation, and quantify the speedup of fixations

14 under this model. Second, we investigate what an appropriate distribution for standing genetic variation may be. Many studies assume that alleles exist at equilibrium frequencies. However, we demonstrate that the time required to reach the equilibrium distribution may be too long in many cases. As an alternative, the MMWF model explicitly includes the variation accumulated after a given average number of generations.

Chapter 6 - Wright-Fisher Exact Solver 2

In chapter 6, we describe a substantial rewrite and update the to Wright-Fisher Exact Solver, which implements a set of new models, calculations, and computational improvements. We calculate the rate of substitution, and describe the calculation of the full probability dis- tribution of the time to next substitution directly from the WF model. We also address limitations of a widely used small population size approximation, and explore it as an alter- native to performing calculations that explicitly include every allele. mathematical treatment of random genetic drift, mutation, and selection

15 Chapter 2

Wright-Fisher Exact Solver (WFES): Scalable analysis of population genetic models without simulation or diffusion theory

This work has been published as an application note in Oxford Bioinformatics, as [1].

Contributions

BS and JdK developed the method, IK implemented the method, IK and JdK analyzed the data, and IK, BS, and JdK wrote the paper.

Abstract

Motivation: The simplifying assumptions that are used widely in theoretical population genetics may not always be appropriate for empirical population genetics. General computa- tional approaches that do not require the assumptions of classical theory are therefore quite

16 desirable. One such general approach is provided by the theory of absorbing Markov chains, which can be used to obtain exact results by directly analyzing population genetic Markov models, such as the classic bi-allelic Wright-Fisher model. Although these approaches are sometimes used, they are usually forgone in favor of simulation methods, due to the per- ception that they are too computationally burdensome. Here we show that, surprisingly, direct analysis of virtually any Markov chain model in population genetics can be made quite efficient by exploiting transition matrix sparsity and by solving restricted systems of linear equations, allowing a wide variety of exact calculations to be easily and rapidly made on modern workstation computers.

Results: We introduce Wright-Fisher Exact Solver (WFES), a fast and scalable method for direct analysis of Markov chain models in population genetics. WFES can rapidly solve for both long-term and transient behaviors including absorption probabilities, expected absorp- tion times, sojourn times, expected allele age and variance, and others. Our implementation requires only seconds to minutes of runtime on modern workstations and scales to biological population sizes ranging from humans to model organisms.

2.1 Introduction

Diffusion approximations to the underlying Markov chain models of population genetics have been central to theoretical population genetics since the pioneering work of Kimura [14]. One reason for the success of diffusion approaches has been their ability to enable closed-form solutions to be found under simple models. This ability comes at the cost of typically needing to assume weak mutation and weak selection, which may not always be appropriate. Furthermore, when models aren’t simple enough to allow closed-form solutions to be found, as is the case with most models of practical interest, numerical integration must be used to approximate diffusion solutions. This approach of approximating an approximation is

17 indirect, can be problematic at the extremes of parameter ranges [15], and it still requires a set of standard assumptions like weak selection that lack generality. In these cases, the ability to work directly with the original Markov models is advantageous. Methods for the direct analysis of Markov chain models, such as the classical Wright- Fisher model, are well known [5] and are often used in the development of new population genetic theory. However, the utility of direct matrix methods has typically been limited by the computational difficulty of working with very large transition matrices, which are quadratic in size with the effective population size. As a result, it is not uncommon for direct approaches to be applied only to population sizes of 100 − 200 e.g. [16]. For direct Markov chain methods, the key computation is typically the determination of the “fundamental matrix” of the absorbing Markov chain [8], which requires a costly matrix inverse computation. We noticed that for most applications this full computation is unnecessary if the number of originating copies of the mutant allele is known. When this is so, only one row of the matrix inverse is needed, which can be obtained by solving a much simpler linear system. Furthermore, because the transition matrices for most Wright-Fisher models are very sparse, sparsity can be exploited to save both computation time and memory.

WFES is our implementation of these and other ideas.

2.2 Results

2.2.1 Implementation

We developed a rapid, parallel sparse linear algebra approach in C for the direct analysis of discrete-time Markov models, Wright-Fisher Exact Solver (WFES)(Note: this version has since been superseded by WFES2). The implementation currently performs exact computation of: conditional probabilities of absorption (fixation, extinction); expected times to absorption (overall, fixation, or extinction); sojourn times (overall, or conditional); and the expected age of an allele and its variance [17]. Other quantities will be added in later versions.

18 2.2.2 Evaluation

To validate our implementation, we performed a series of comparisons of fixation probabilities to values estimated from forward simulation and diffusion theory under a simple Wright-

Fisher model without mutation or dominance (see Supplemental Results for details). WFES results showed precise correspondence with high-replicate simulation averages (n = 10, 000 fixations) over the entire parameter range, while diffusion approximations clearly became biased with strong selection (s > 0.1; Fig. 2.2). The relative error of diffusion approximations is shown in Figure 2.1, which illustrates the expected bias of standard fixation probability calculations when selection is strong. This expected result is presented as both a validation

and as a demonstration of the generality of WFES, which produces exact results under non- classical parameter ranges.

0 2×103 4×103 6×103 8×103 1×104

Figure 2.1: Diffusion theory bias for strongly selected alleles in a population of Ne = 10, 000.

2.2.3 Discussion

WFES has several advantages over simulation, diffusion theory, and other approximate meth- ods. First, it is generally applicable to virtually all Markov chain models in population genet- ics and can accommodate dominance, two-way mutation, strong selection, and other forces without additional computational cost. Second, it avoids the indeterminacies of numerical

19 integration methods, which arise when computing fundamental quantities with diffusion the- ory. Third, it is extensible to calculate expectations and variances of various properties of the Wright-Fisher model, such as allele age [17]. Fourth, our approach permits exact results for population sizes up to about 30,000 on typical workstation computers. For larger population sizes, truncation of the near-zero entries of the transition matrix (Table 2.1) can yield results within desired precision for population sizes beyond 100,000 and still only requires a couple of minutes of wallclock time. While very large population sizes on the order of 1,000,000 are theoretically possible using this approach, they exceed the memory resources of typical computers at this time. For population sizes in the computable range, WFES produces exact results typically in far less time than running high-replicate simulations. The code is freely available and can be easily modified to implement new calculations.

2.3 Methods

To enable scalable computation with population genetic Markov models we developed sev- eral approaches to make the computations feasible, as described in the subsequent sections. First, we calculate the relevant Wright-Fisher transition matrix with a recursive algorithm for rapidly computing whole rows, and a row-parallel implementation (section 2.3.4). Sec- ond, we solve restricted linear systems (section 2.3.2) using LU decomposition followed by back substitution with routines that exploit sparsity and parallelism. By far the most time- consuming step is the LU decomposition itself. Given the LU decomposition of the rele- vant sparse matrix, the remaining computations take only seconds even for population sizes around Ne = 100, 000. WFES is written in C and is designed to be fast, scalable, and easy to modify. Our implementation exploits routines in the Intel MKL PARDISO [18, 19, 20] library, a state of the art linear solver commonly used for high-performance computing applications. For convenience, our distribution (https://github.com/dekoning-lab/wfes) includes the freely-

20 distributable libraries needed to compile and run the program.

2.3.1 Finite absorbing Markov chain theory

In this section, we describe several well known results from the theory of absorbing Markov chains without giving explicit proofs. These can generally be found in [8], alongside examples and further reading. Given the transition matrix of any Wright-Fisher model, P , having two absorbing states

(for extinction and fixation) and an effective population size of Ne, we first re-order the rows and columns without changing the actual entries. Following standard theory [8], this re-ordering groups all of the transient states in their original suborder, followed by all of the absorbing states in their original suborder, so that the matrix is represented as

  QR   P =   , (2.3.1) 0 I2

where Q is a (2N − 1) × (2N − 1) matrix of transient-to-transient state transitions, R is a nonzero (2N − 1) × 2 matrix of transient-to-absorbing state transitions, I2 is the 2 × 2 identity matrix reflecting absorbing-to-absorbing state transitions, and 0 is an 2 × (2N − 1) matrix where every entry is 0, reflecting the absorbing-to-transient state transitions. Then

∞ X N = Qk = (I − Q)−1 (2.3.2) k=0

is called the fundamental matrix of the Markov chain. Each entry Nij of the fundamental matrix gives the expected number of times the chain is in state j, given that the chain started in transient state i. The variance on this is the (i, j)th entry of

N2 = N(2Ndg − I) − Nsq (2.3.3)

21 where Ndg is a diagonal matrix with the same diagonal as N, and Nsq is the Hadamard product of N with itself, or entry-wise squared. Summing the ith row will give the expected time until absorption, given that the chain started in transient state i (and unconditional on any specific absorbing state). Equivalently, we can take the ith entry of the vector

t = N1 (2.3.4)

In biological terms, this is the expected time until either fixation or extinction occurs. The variance on this is the ith entry of

v = (2N − It)t − tsq (2.3.5)

where tsq is the Hadamard product of t with itself. The probability of absorbing in state j having started in state i is the (i, j)th entry of the matrix

B = NR (2.3.6)

In our case, B will be a (2N −1)×2 matrix, where the first and second columns correspond to probabilities of extinction and fixation respectively. The expected number of times the chain is in state j, conditional on starting in transient state i and absorbing in state k is

Bjk Eik(j) = Nij (2.3.7) Bik

Then the expected number of steps before absorption, conditional on starting in transient state i and absorbing in state k is

22 2N−1 X Eik(j) (2.3.8) j=1

For completeness, we also give this same result in matrix form. Define Dk as a diagonal matrix with diagonal entries bjk, for a fixed absorbing state k. Then the expected number of steps before absorption having started in transient state i and conditional on absorbing in state k is the ith entry of

˜ −1 t = Dk NDk1 (2.3.9)

where 1 again is a vector of 1s. The variance on this is the ith entry of

−1 ˜ ˜ ˜v = (2Dk NDk − It)t − tsq (2.3.10)

Higher moments for the quantities given are quite complicated. Closed form expressions for higher-order moments were recently derived in [21]. Many other quantities can be calculated using similar approaches. For example, we recently used this approach to develop an exact method for computing the expected age of an allele and its variance, which is implemented in WFES and is fully described in [2]. As other quantities are added in future versions of WFES they will be fully documented.

2.3.2 Rapid solution of restricted linear systems

Significant computational savings occur when assuming that we know i. For example, it is commonly assumed that i = 1 or, equivalently, that the mutant enters the population as a single copy. In this case, the above calculations require only the first row of N. This simplifies our computation considerably, because instead of computing an entire matrix inverse, we can instead just solve the linear system

23 T (I − Q) N1 = I1 (2.3.11)

for N1, where I1 is the first column of the identity matrix. We solve this system by first obtaining a decomposition of (I − Q)T . We do have to calculate the entire B matrix in order to obtain conditional times to absorption. Fortunately, B only has two columns. We solve for the first by considering another system of linear equations

(I − Q)B1 = R1 (2.3.12)

where B1 is the first column of B and R1 is the first column of R. Given that the decompo-

T sition of (I − Q) was obtained in solving for N1, the solution to this system is trivial. Since the Wright-Fisher model only contains two absorbing states, we can now compute

B2 = 1 − B1 (2.3.13)

where 1 is a vector of 1s and B2 is the second column of B. Then B1,1 is the probability of extinction and B1,2 is the probability of fixation given that we started with a single copy. We can now use equation (2.3.7) to compute the expected time to fixation, given that fixation occurs, and likewise for extinction.

2.3.3 Parameterization of the Wright-Fisher model

The Wright-Fisher model describes the time-evolution of a bi-allelic locus in a population of fixed size with Ne individuals and non-overlapping generations. In its standard form, the model describes the number of mutant alleles in the next generation as a binomial draw from the number in the current generation. Assuming a diploid population with 2Ne chromosomes,

24 2N  P = e (ψ )j(1 − ψ )2Ne−j, (2.3.14) i,j j i i

where ψi is the probability of being a mutant allele in the next generation given i alleles in the current generation, and Pi,j expresses the chance of the population moving from i to j copies of a mutant allele within one generation. Following Ewens [5], we parameterize the fitness of diploid individuals as follows, where a is the wild-type state and A the mutant:

Genotype Fitness

AA 1 + s Aa 1 + sh aa 1

Given this fitness model and bi-directional mutation, the corresponding transition prob- ability matrix can be expressed using the above formula for Pi,j and

(1 + s)i2 + (1 + sh)i(2N − i)(1 − u) + (1 + sh)i(2N − i) + (2N − i)2v ψ = (2.3.15) i (1 + s)i2 + 2(1 + sh)i(2N − i) + (2N − i)2

where v is the forward mutation rate, u the backward mutation rate, s the selection coefficient, and h the dominance coefficient. Note: a correction has been added to the equation above compared to the published version. i and 2N − i should represent allele counts of the derived and ancestral alleles respectively.

Haploid model

For reference, we also implemented an analogous haploid model (with Ne chromosomes), including selection and bi-directional mutation:

25 N  P = e (ψ0)j(1 − ψ0)Ne−j, (2.3.16) i,j j i i

0 i(s + 1)(1 − u) + (Ne − i)v ψi = (2.3.17) i(s + 1) + Ne − i

2.3.4 Rapid calculation of the transition matrix

To efficiently calculate the Wright-Fisher transition probability matrix, we note the following recurrence relation for each row based on equation 2.3.14

2Ne pi,0 = (1 − ψi) (2.3.18) 2N − j + 1 p = p · e · c (2.3.19) i,j i,j−1 j i

where ψi and ci = ψi/(1 − ψi) must only be computed once for each row. Since each row of the matrix is independent of the others, rows may be calculated in parallel and merged into a sparse matrix format relatively easily.

2.4 Additional Results

2.4.1 Comparison of Exact and Approximate Results

Supplementary Figure 1 shows how the probability of fixation changes under strong selection for the exact method (WFES), forward simulation (WF Monte Carlo), and under several diffusion approximations owing to Kimura. In this figure, as in the main text,“Kimura diffusion” refers to Kimura’s analytic probability of fixation [22] ignoring mutation and dominance in a diploid population:

26 −s (1) 1 − e PFix = (2.4.1) 1 − e−2Nes

assuming that the initial number of copies of the mutant allele is p = 1/(2Ne). Similarly, “Kimura diffusion (weak selection)” refers to the usual further approximation of this result assuming small s:

(2) s PFix ≈ (2.4.2) 1 − e−2Nes

Note that the form of both of these equations is based upon the parameterization of the Wright-Fisher model described above (Supplementary Methods section 1.3), which differs slightly from the form used by Kimura.

50% Kimura diffusion Kimura diffusion (weak selection) WF Monte Carlo 40% WF exact (WFES)

30%

20%

Probability of fixation 10%

0% 0 2e3 4e3 6e3 8e3 1e4 Population-scaled selection coefficient

Figure 2.2: Probability of fixation for strongly selected alleles in a population of Ne = 10, 000.

27 2.4.2 Effect of Truncation

Each row of the Wright-Fisher matrix is essentially a binomial distribution. By cutting off the long near-zero tails of the distribution, it is possible to increase the sparsity of the system 2.3, thus lowering both the CPU and RAM requirements.

Figure 2.3: A small N = 10 WF transition matrix. The color of each cell represents the probability of transition for i to j copies within a single generation. Note the sparsity of the matrix.

We investigated the effects of different threshold values () on the accuracy of the solver in Supplementary Table 1. A threshold of 1e − 20 appears to provide good accuracy and performance for reasonably large population sizes. We note that for larger population sizes (i.e. above N = 100, 000), the threshold value will have to be lowered to obtain exact solutions (within machine precision) on most current workstation computers.

28 Table 2.1: WFES performance with truncation, Ne=50,000. Benchmarked on a 16-core Intel Xeon CPU.

Epsilon Relative error% Memory usage, GB Runtime, s 1e − 25 0∗ 27.8 122.268 1e − 20 0∗ 24.5 110.040 1e − 15 0∗ 20.7 102.372 1e − 10 0.02 16.1 80.684 1e − 09 69.90 12.5 75.644 *Zero at machine precision.

29 Chapter 3

Revisiting Allele Age: The Joint Effects of Selection, Dominance and Mutation using an Exact Markov Chain Approach

This work has been published as a research article in Scientific Reports (Nature), as [2].

Contributions

JdK designed the research, BS formulated the methodology, JdK implemented the methodol- ogy and performed the analyses, IK implemented the simulation methodology and performed analyses, and JdK and BS wrote the manuscript. All authors reviewed the manuscript.

Abstract

Determination of the age of an allele based on its population frequency is a well-studied problem in theoretical population genetics, for which a variety of approximations have been

30 proposed over four decades. In this report, we present a new result that allows the expecta- tion and variance of allele age to be easily computed exactly for any finite absorbing Markov chain model in a matter of seconds. This approach exploits modern sparse linear algebra techniques, implicitly integrates over all sample paths, requires neither reversibility nor the infinite sites assumption, and is rapidly computable on most workstation computers for pop-

ulations sizes up to around Ne = 100, 000. Using this computational population genetics approach, we demonstrate violations of classical results on allele age and show direct sup- port for recently identified stochastic slowdowns (i.e., “ selective strolls”), wherein weakly selected rare alleles are expected to be older than neutral alleles observed at the same fre- quency. Because our approach is general with respect to underlying modelling assumptions, we were also able to explore allele age under non-classical assumptions including bidirectional mutation and large population-scaled mutation rates (up to θ ≈ 1) relevant to many viruses, pathogens, hyperdiverse eukaryotes, and perhaps other groups. In the most extreme case we studied, we identify a strong stochastic slowdown for weakly deleterious, rare, recessive alleles, where expected extinction times are over 67% longer than for neutral alleles. We also note a strong age imbalance with respect to selection, such that rare deleterious alleles are expected to be substantially older than advantageous alleles observed at the same frequency. These results highlight the generality, feasibility, and under-appreciated utility of computa- tional methods for the direct analysis of Markov chain models in population genetics.

3.1 Introduction

Allele age is generally defined as the duration of time a mutant allele has been segregating in a population. The problem of calculating the expected age of an allele given its current population frequency is an important problem in population genomics (e.g., [23]) with a long history of theoretical investigations (e.g. [24, 25, 26, 27, 28]; reviewed in [29]). One reason that allele age remains an important problem is that the effects of selection and

31 allele age can be highly confounded in terms of their influence on population frequency. That is, an allele may be at low frequency because it is deleterious or simply because it is young. Inferences about the fitness effects of segregating polymorphisms should therefore make some consideration of allele age, either explicitly or implicitly. Methods for inferring fitness impacts of segregating polymorphisms based on their ages have also been proposed [30]. The first theoretical analysis of allele age was developed by Kimura and Ohta [31] using a continuous-time diffusion approximation to the age of a neutral allele in a finite population. Later work added consideration of selection [24], yielding the well-known result that allele age is expected to be symmetric with respect to the direction of selection, and that neutral alleles are expected to be older than selected alleles observed at the same frequency (the “Maruyama effect” hereafter). Recently, an interesting exception to these classical results has been pointed out [32, 33, 34]. Mafessoni et al. [34] showed that weakly selected rare alleles are expected to be about 5% older than neutral alleles observed at the same frequency, when heterozygote fitness is non-additive. This phenomenon appears to be an example of a more general behaviour recently termed ‘stochastic slowdown’ [32], where weak selection counter-intuitively prolongs, rather than shortens, the average time to absorption. It is important to understand the generality of these findings, since, as Mafessoni et al. [34] point out, many new mutations arising in a population are expected to be recessive and weakly deleterious, and it is conceivable that this slowdown effect could thereby mislead attempts to make inferences about . Previous investigations of allele age, and classical approaches in population genetics more generally, have required that mutation rates are assumed to be so slow that no additional mu- tations can occur during the segregation of an initial variant (implying that the population- scaled mutation rate, θ, is very small or ≈ 0). However, cases where this assumption is violated in nature are increasingly being reported. While in most eukaryotes, the population- scaled mutation rate, θ, is  0.05, several examples of so-called hyperdiverse eukaryotes are

32 known with θ between 0.05 and 0.15 [35]. In addition, θ in many organisms including viruses and pathogens can be much larger and can even exceed 1. For example, θ in HIV-1 has been estimated to be between 10 and 369 [36], while in the malaria parasite Plasmodium, a protist, θ has been recently estimated to be between 3.3 and 10.6 [37]. Other arguments that classical assumptions about θ may be violated in nature have also been recently put forward. For example, Messer and Petrov [38] have highlighted that most known cases of molecular adaptation across diverse organisms show signatures of soft selective sweeps (but see [39]), where adaptive alleles have multiple origins either by recurrent mutation or migra- tion. These findings are potentially unexpected if evolution is strongly mutation-limited and may indicate that the effective population-scaled mutation rate is underestimated in many cases and/or that adaptation may tend to occur during periods of episodically large popula- tion size (and thus, high θ) [40]. We therefore decided to revisit the problem of calculating allele age based on population frequency under non-classical assumptions, and in particular to examine the impact of large values of θ on the expected age of an allele. For beneficial variants, the values of θ that we consider are expected to produce adaptive fixations that may have either multiple mutational origins or single origins [41]. To study the effects of non-classical parameter ranges on allele age, we develop a new ex- act approach capable of rapidly computing moments of the allele age distribution under any absorbing discrete-time Markov chain model of population genetics. This approach exploits sparsity, parallelism, and modern computational architectures [1], and is completely general with respect to the underlying model. It therefore requires none of the classical simplifying assumptions (e.g., weak selection, weak mutation, infinite sites, etc.) For the purposes of the present study, we assumed a biallelic diploid Wright-Fisher model [5] including bidirec- tional mutation, selection and dominance. Computationally, our solution mainly relies on back-substitutions using an LU decomposition of a sparse matrix derived from the model’s transition matrix, and does not use any matrix-matrix multiplications, which can be expen- sive to implement. To the best of our knowledge, this is the first computationally feasible,

33 exact approach for computing allele age (or its moments) to be proposed. Calculation of the expected value and variance of allele age is fast, exact and scales easily to realistic population

5 sizes (Ne ≈ 10 ). We have implemented this method in our software package Wright-Fisher Exact Solver, WFES [1] (available at https://github.com/dekoning-lab/wfes/).

3.2 Results

Using the approach outlined above and described fully in the Methods, we considered allele age and related quantities in a biallelic Wright-Fisher model including bidirectional muta- tion, selection, and dominance. For selection coefficient s and dominance coefficient h, the homozygous wildtype fitness was defined as 1, heterozygote fitness as 1 + sh, and homozy- gous mutant fitness as 1 + s (following [5]). Bi-directional mutation was modelled in the Wright-Fisher transition matrix [5], with extinction and fixation states artificially set to be absorbing. By allowing mutation, we assume that an arbitrary number of new mutations could potentially arise in the population while an initial mutant is segregating, and thus the assumption of shared ancestry of all segregating mutants is not necessarily made. As a result, allele age should be interpreted here in the population context as the expected age of the mutant state, starting from a population that was monomorphic for the wild-type state. It should be noted that, as a result, “age” may cease to be a meaningful concept when θ  1. Except where otherwise specified, all results are for a rare allele observed in x = 10

copies, sampled from an effective population size of Ne = 10, 000 diploids. Forward and backward mutation rates were assumed equal. We consider a range of population-scaled

mutation rates, θ = 4Neµ, between θ = 0.0048 and θ = 0.96, where µ is the mutation rate per site per chromosome, and Ne the effective population size. Results obtained using values of θ that were two orders of magnitude smaller than θ = 0.0048 were largely similar (not shown).

34 Table 3.1: Parameters used throughout this study and their meanings Symbol Value Meaning 3 4 Ne {10 , 10 } Effective population size µ Varies Mutation rate per generation per site θ Varies Population-scaled mutation rate, 4Neµ s Varies Selection coefficient h { 0, 0.5, 1} Dominance coefficient p Usually 1 Number of copies of mutant at origination x Usually 10 Observed number of mutant alleles

3.2.1 Validation by comparison to other methods

We first examined the correspondence between expected allele age determined by exact computation with the Wright-Fisher Markov model and the expected allele age approximated using Kimura and Ohta’s diffusion approach [31]. Since Kimura and Ohta’s method assumes no selection and no mutation, we ran our computations on a Wright-Fisher model having these same assumptions. Across a range of effective populations sizes and observed allele counts, the methods exhibited close correspondence (Table 3.2).

Table 3.2: Expected neutral allele age determined by exact computation (this study) and by Kimura and Ohta’s [42] diffusion approximation. No selection or mutation were assumed in the underlying Wright-Fisher model to ensure that the assumptions of both methods were compatible. Ne x Diffusion Exact 1,000 10 106.5 103.73 5,000 10 138.29 134.99 10,000 10 152.09 148.56 20,000 10 165.92 162.16 50,000 10 184.23 180.16 1,000 100 630.68 628.65 5,000 100 930.34 927.8 10,000 100 1,064.99 1,062.22 20,000 100 1,201.30 1,198.30 50,000 100 1,382.93 1,379.63 1,000 1,000 2,772.59 2,771.02 5,000 1,000 5,116.86 5,115.03 10,000 1,000 6,306.80 6,304.77 20,000 1,000 7,566.93 7,564.69 50,000 1,000 9,303.37 9,300.85

35 Table 3.3: Representative expected allele age and variance including selection, dominance and mutation determined by simulation and exact computation. A diploid population of Ne = 1, 000 was assumed with p = 1 and x = 10. Simulation Exact θ 2Nes h Mean Std. Dev. Mean Std. Dev. 0.01 0 NA 106.67 391.64 106.39 389.57 0.05 0 NA 118.21 433.01 117.99 431.67 0.1 0 NA 135.17 493.47 134.91 491.67 0.5 0 NA 477.75 1,531.31 477.67 1,531.58 0.96 0 NA 3,315.99 7,775.71 3,320.94 7,791.84 0.01 -3 0.0 116.69 449.90 116.42 449.90 0.01 -3 0.5 97.04 319.34 96.86 317.74 0.01 -3 1.0 84.51 238.28 84.38 237.13 0.01 3 0.0 91.62 275.07 91.47 273.91 0.01 3 0.5 96.63 317.56 96.46 316.12 0.01 3 1.0 100.73 369.59 100.53 367.69 0.96 -3 0.0 4,746.56 9,021.43 4,742.61 9,011.70 0.96 -3 0.5 2,994.45 5,757.48 2,990.35 5,746.30 0.96 -3 1.0 1,996.64 3,813.41 1,994.75 3,808.97 0.96 3 0.0 729.07 2,090.27 728.74 2,088.35 0.96 3 0.5 773.22 2,759.72 773.03 2,758.71 0.96 3 1.0 933.51 4,004.62 932.89 4,004.79

36 iuain eemc oetm osmn hntedrc optto ftemoments the of computation direct the than consuming time more runtimes much their were that Simulations so parallelized, (see and fast reasonably C++ be would in implemented were approach simulations age exact Allele the of advantages Computational 3.2.2 3.1. Figure in for cases shown of are subset simulation parameter a by entire approximated the distributions across probability computations Allele-frequency model-based the range. with well agreed manner this in ( path sample the of frequency, observed beginning the the at reversed starting this performed of be then simulations can time” model “Forward [43]. the as matrix distribution transition stationary same of (forward-time) the original have direction to the modified is reversing that model by Wright-Fisher age a simulation Allele in by time approximated 3.3). (Table be mutation can and distributions dominance, probability selection, included that simulations age in accumulated is tail the of S3. portion A: and undisplayed the bin. tails, final long the simu- very by have determined distributions distributions age probability age allele for lation.Simulations neutral Representative 3.1: Figure A. enx aiae u ehdadisipeetto ycmaigrslst allele to results comparing by implementation its and method our validated next We Frequency 0.00 0.01 0.02 0.03

0 ● 100 ●

200 θ Allele age(generations) 0 = 300 N

. 400 1 B: 01. e 1 =

500 https://github.com/dekoning-lab/allele_age_simulator/ ,

0 eepromd1 ilo ie with times million 10 performed were 000 600 θ Exact computation Simulation estimate(n=10M)

0 = 700 p/

(2 800 . 6 nemdaevle of values Intermediate 96. N

900 e ;seMtosfrdtis.Smltosperformed Simulations details). for Methods see );

>1000 37 B.

Frequency 0.00 0.02 0.04

0 ● 3000 ●

6000 x/ Allele age(generations) θ 9000 r hw nFgrsS2 Figures in shown are (2

N 12000 e ,adrniguntil running and ), s 15000 .A h allele the As 0. =

18000

21000

24000

27000

>30000 ). using our approach (e.g., 15 minutes versus 0.6 seconds for θ = 0.01; Table 3.4). As θ was increased, the simulations took increasingly more time both because the allele trajectories grew longer on average and because higher mutation rates also increased the variance in the duration of allele age trajectories. For θ = 0.96, running a 10 million replicate simulation over 32 cores took approximately 13 hours. On the other hand, the runtime for the exact matrix method was constant across different mutation rates and averaged about 0.5 seconds.

Table 3.4: Representative run-times (wall clock) for parallel computation of neutral allele age. Simulation∗ Exact† θ Time (sec.) Time (sec.) 0.01 911.12 0.58 0.05 1,021.56 0.61 0.1 1,190.46 0.60 0.5 5,143.14 0.44 0.96 46,723.40 0.64

3.2.3 Direct demonstration of classical results

Several classical results pertaining to allele age can be directly obtained by examining ex- pected allele age and variance as a function of selection (Figure 3.2). It should be emphasized that these plots are neither probability distributions nor estimates. Rather, they are the ex- act moments of allele age derived directly from the Wright-Fisher model, as explained in the Methods section. For rare alleles, the expected allele age has a large variance relative to the mean and the mean age is roughly symmetric with respect to the sign of the selection coefficient, with neutral alleles expected to be older than selected alleles (Figure 3.2 B, leftmost column; the “Maruyama effect” [24]). The symmetry of allele age with respect to the direction of selection is among the most conspicuous classical findings on allele age, and has been the subject of recent study, where different authors have both supported it using population genomic data [30] and argued against it using simulations that included linkage [44].

38 epc otedrcino eeto B et.We h uainrt nrae,a age an over results increases, Full rate right). mutation to the (left with appears When symmetric left). selection of is of C: grid left). direction age and larger (B: the a allele (A to selection additive, frequency respect of is same with direction imbalance fitness the the at heterozygote to observed and alleles respect weak neutral is than mutation older When be to expected N rate. mutation and nance, 3.2: Figure A. C. h=1.0 B. e 10 = h=0.0 h=0.5 , 0 ilis hnhtrzgt tesi o-diie ekyslce lee are alleles selected weakly non-additive, is fitness heterozygote When diploids. 000 xetdall g n ainea ucino eeto,domi- selection, of function a as variance and age allele Expected

Expected Allele Age Expected Allele Age Expected Allele Age θ 0 500 1000 1500 0 200 400 600 800 1000 1200 1400 0 500 1000 1500 a efudi i S1. Fig in found be can −40 Selection (2N −20 0.0048 0 20 e s) 40 l acltoswr aefrarr lee( allele rare a for made were calculations All

0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 2500 −40 Selection (2N −20 0.1109 39 0 20 e s) 40 θ = 4N 0 2000 4000 6000 8000 10000 12000 0 2000 4000 6000 8000 10000 12000 0 5000 10000 15000 −40 e

μ Selection (2N −20 0.5355 0 20 e s) 40

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 0 20000 40000 60000 80000 100000 120000 0 50000 100000 150000 x −40 0 assuming 10) = Selection (2N −20 0.9600 0 20 e s) 40 3.2.4 Selective strolls and stochastic slowdowns

Recent work [32, 33, 34] has convincingly demonstrated, using primarily simulation and diffusion theory methods, that weakly selected alleles are sometimes expected to be older than neutral alleles observed at the same frequency when fitness in heterozygotes is non- additive. This idea was termed “selective strolls” by Mafessoni et al. [34], referring to the observation that selected variants may sometimes persist in a population slightly longer than neutral ones. Here we directly reproduce this effect for rare recessive alleles (h = 0), where it can be seen that weakly deleterious alleles are expected to be older than neutral alleles at the same frequency (Figure 3.2 A, leftmost column), and for dominant alleles (h = 1), where it can be seen that weakly advantageous alleles are expected to be older than neutral alleles at the same frequency (Figure 3.2 C, leftmost column). Consistent with the findings of Mafessoni et al. [34], it is apparent that the selective stroll effect size is not very large and is on the order of about 5%.

3.2.5 Fast mutation and age imbalance

Contrary to the Maruyama effect, for mutation rates approaching θ ≈ 1 the mean allele age becomes strongly asymmetric around s = 0 (Figure 3.2, c.f. left to right) such that weakly to moderately deleterious alleles can on average be substantially older than advantageous alleles at the same frequency. We refer to this previously unobserved phenomenon as “age imbalance”. Under age imbalance, slightly deleterious alleles are also expected to generally be older than neutral alleles at the same frequency. This new example of stochastic slowdown is observed even when heterozygote fitness is additive (i.e., with h = 0.5). The effect size in this case is substantially larger than for the previously noted slowdowns with small θ (or θ = 0; [34]). For example, expected extinction times for the oldest alleles with h = 0.5 are approximately 22.7% longer than for neutral alleles. Rare recessive alleles (h = 0) under fast recurrent mutation (Figure 3.2, right) experience

40 ihrsett h ieto fslcin otayt h lsia euto auaaand Maruyama of result classical the to contrary selection, of [25]. direction Kimura the to respect with [1]. 0 3.3: Figure . 96 acltdb xc optto ihteWih-ihrMro model Markov Wright-Fisher the with computation exact by calculated ) aaee ausue eetesm si iue32 oetesrn asymmetry strong the Note 3.2. Figure in as same the were used values Parameter C. h=1.0 B. A. h=0.5 h=0.0 xetdetnto n xto ie hnmtto ssrn ( strong is mutation when times fixation and extinction Expected

Extinction Time Extinction Time Extinction Time

0 4,000 8,000 0 4,000 10,000 0 5,000 15,000 −10 Selection (2Ns) −5 41 0 5 10

Fixation Time Fixation Time Fixation Time

50,000 150,000 −10 50,000 150,000 50,000 150,000 Selection (2Ns) −5 0 5 10 θ = the same effect but to an even greater degree. Recessivity and fast mutation appear to have a similar and mutually reinforcing effect on both age imbalance and the stochastic slowdown under weak selection. Both selective stroll and age imbalance results appear to be explained primarily by the average time to extinction (Figure 3.3, left), which indicates that when mutation rates are bidirectionally fast, weakly deleterious alleles counter-intuitively take longer to go extinct than do advantageous alleles. For h = 0 extinction times are even longer for deleterious recessive alleles than for those with h = 0.5, but now the expected fixation times also show a similar imbalance with respect to the direction of selection (Figure 3.3, c.f. A and B), which accentuates the stochastic slowdown further. Remarkably, the expected time to extinction for the oldest, weakly selected recessive alleles is about 66.9% longer than for neutral alleles (Figure 3.2 A, left). The same results for h = 1 are shown in Figure 3.3 C, where fixation times are shifted to the right rather than the left, which seems to largely cancel out the stochastic slowdown caused by the left-shifted extinction times. To help explain Figure 3.3, we also calculated the conditional sojourn times for mutants that go to extinction, and compared these to sojourn times for neutral variants (Figure 3.4). For deleterious alleles, we see that the time spent at low frequencies increases as we move away from 2Nes = 0 until 2Nes = −2.53 is reached); the stronger selection is within this parameter range, the more extinction sojourns are dominated by residency at lower fre- quencies compared to neutral. While this trend is expected since negative selection opposes increases in allele frequency, it is surprising that the net change in non-neutral sojourn times is positive. That is, the increased time at low frequencies surpasses the decreased time spent at high frequencies, resulting in longer sojourns overall. This phenomenon has also been reported for previously noted stochastic slowdowns [33].

3.2.6 Allowing the starting number of copies to vary

When population-scaled mutation rates are very high it can become plausible that an origi- nating mutation enters the population in several copies (i.e., that it simultaneously occurs in

42 Figure 3.4: Difference in conditional sojourn times (compared to neutral) for selected alleles going to extinction. Curves span approximately 2Nes = {−3, 3}; h = 0, θ = 0.96. Top: increasing selection against the mutant allele up to the critical point 2Nes = −2.53 counter-intuitively increases sojourn times by prolonging residency in low frequency classes. Bottom: increasing selection favouring the mutant allele decreases the length of extinction sojourns. Bold: 2Nes = ±2.53. The maximum of the extinction time curve in Figure 3.3 A is at −2.53.

43 several individuals). For example, when θ = 0.96, the average number of mutations entering the population per generation is 0.96/2 = 0.48, so on average there will be a new mutation every two generations. The probability of a population generating multiple copies of the mutant allele in a single generation, assuming mutations are Poisson distributed, is ≈ 0.38. This may pose problems for any method for calculating allele age, since when the likelihood of a population simultaneously generating more than one mutant becomes non-negligible, the starting number of copies, p, should be integrated out. To integrate over p we consider the probability of starting in p copies, given that p ∼ P oisson(λ = θ/2). This can be easily implemented in our computational procedure starting at equation (3.4.8) by reusing the LU decomposition of (I − Q)T , which does not depend on p (see Methods). Since this decomposition is by far the most computationally expensive operation, the integrated solution is trivially harder than when assuming a single starting copy. In addition, since the probability of large numbers of mutations occurring in the same generation will typically be negligible, we define a threshold  such that only starting configurations with a probability greater than  are considered. Below, we assumed  = 10−5. In Figure 3.5 we show the effect of numerically integrating over p when θ = 0.96 for the range of mutation rates, selection coefficients, and dominance coefficients considered throughout the manuscript. In most cases, the results were identical at better than three to four decimal places, and only began to diverge slightly when θ was very large (i.e., θ = 0.96). It is possible that other statistics of the Markov process might change more than this as a function of p, and thus to be conservative one may choose to always integrate over p (particularly since this adds only seconds to the compute time). However, we conclude that assuming p = 1 (as is done by convention in all previous studies of allele age that we are aware of) is likely to introduce no bias unless θ is quite large (i.e.,  1).

44 Figure 3.5: Effect of integrating out uncertainty in p. The integral (summation) was taken to a finite number of terms such that all values with p ≥ 10−5 were considered. Points represent all parameter combinations considered in Figure 3.2.

45 3.3 Discussion

Computational population genetics approaches offer the relatively straightforward ability to explore parameter ranges or assumptions that may be inaccessible to classical theory. Usually simulations are used to address scenarios where the assumptions of classical theory may be violated. However, simulations can often be slow, require long runtimes to obtain precise estimates for rare events, and can scale poorly to large populations. An alternative computational approach is to find a class of models whose properties can be interrogated directly, without the need for simulation. For example, Steinruecken et al. [45] recently showed how the transition density function of biallelic Wright-Fisher diffusions [46] could be approximately computed, eliminating the need for a variety of simulations (although allele age has not been considered in this framework). Here we have shown that even the exact computational analysis of biallelic Markov models (including Wright-Fisher models) can be made efficient enough to often eliminate the need for either simulations or diffusion approximations in the first place. Markov chain models are typically discounted early in the lifecycle of a population genetic investigation in favour of diffusion approximations, since they are widely viewed as impractical to work with due to their large and potentially unwieldy state spaces. Contrariwise, here and elsewhere [1], we have shown that judicious computation, sparsity, and parallelism can be together exploited to rather surprising effect, making exact computation under general Markov models not only tractable but capable of generating new insights with ease. Working directly with the underlying Markov models of population genetics has a number of advantages. For example, when strong mutation is included, absorbing boundaries can artificially become inaccessible in a diffusion. There is no corresponding problem when studying the unapproximated Markov chain. In addition, diffusion approaches cannot easily describe behaviours at the absorbing boundaries (but see [47]). One of the most appealing aspects of this computational population genetics approach is that it is general with respect to underlying modelling assumptions, as long as they can be

46 expressed as a finite absorbing Markov chain. This approach also has several advantages over simulations, including fast runtimes that are relatively insensitive to modelling assumptions (Table 3.4), and exact results (within machine precision) even for small effects or rare events that would otherwise require long-run, high replicate simulations to study. For a population size of Ne = 10, 000, exact calculation of the expected allele age and variance, absorption probabilities and times, and conditional sojourn times, takes only about 6.5 seconds using 16 Intel E5-2670 cores (2.60GHz) in our reference implementation [1]. Models with greater sparsity are even faster and can scale much better. For example, the same analysis under a comparable Moran model takes only about 0.25 seconds [1]. The method proposed for calculating allele age is based on the efficient computation of the moments of the probability distribution of allele ages. It is therefore appropriate to view these quantities not as estimates, but as exact results for a given model. An advantage of this approach is that the expected value of the allele age probability distribution will more often be much closer to the true allele age than would a maximum likelihood estimator, since the age distributions are both highly skewed and very long tailed (see Figure 3.1). A potential disadvantage is that we must assume that the true population frequency is known without error. In cases where it is not, error in the observed frequency could be accounted for by computing allele age for a range of population frequencies centred on the observed value. As shown in Figure 3.2, classical allele age results [25, 24, 42] can be easily obtained for general population genetic models with our approach. We also reproduced exact rep- resentations of recently discovered effects, such as “selective strolls”, which have a smaller effect on expected allele age when mutation rates are low (also see [34]). By exploiting the generality of our approach, we provided new evidence for a stochastic slowdown that occurs when bidirectional mutation is fast, such that rare, weakly deleterious alleles are expected to be substantially older than neutral alleles. In the most extreme case, average extinction times for the oldest alleles were 22% and 68% longer than for neutral alleles (for h = 0.5 and

47 h = 0, respectively). Finally, we found that when relaxing the assumption of weak muta- tion, a large age imbalance arises with respect to selection, such that rare deleterious alleles are expected to be old and rare advantageous alleles very young. This may be explained in part by the expectation that with strong mutation pressure and positive selection, allele frequencies will rise rapidly following origination. When this is true, the best explanation for a beneficial allele being rare is that it only arose quite recently. This expected rapid rise in mutant frequency under strong mutation and positive selection may also be responsible for the much faster extinction times for beneficial alleles compared to deleterious ones (Figure 3.3: left), since the longer beneficial alleles persist, the more likely their frequencies are to be pushed upwards towards fixation. Consequently, the mutants that go to extinction are most likely to do so quickly. A potential limitation of our approach to calculating allele age is that we have assumed equilibrium demography with constant population size. However, this is a limitation of our implementation rather than of the method itself. One solution to this problem is to consider instantaneous switches among different population sizes under a Markov-modulated model. By virtue of our sparse linear algebra approach, this would only be linearly more difficult than the constant population size approach. It could also have advantages over existing diffusion theory methods [48], for example, by faithfully modelling an increase in the population mutation rate during population growth that includes the effect of recurrent mutation. Such considerations may be important for understanding adaptation in organisms with “boom and bust” population dynamics [40]. We leave exploration of these ideas for future work.

3.4 Materials and Methods

3.4.1 Theory

Let X(t) be an absorbing discrete-time Markov chain with known transition matrix P and state-space defined by the number of copies of a mutant allele in a population of Ne effective

48 diploid individuals. Let Q be the submatrix of P that contains only transient-to-transient state transitions. Assume that the current number of mutant alleles x is a transient state, so the allele in question is neither extinct nor fixed. We also assume that the allele entered

the population at a specific frequency p/(2Ne), where p is a transient state (we later show how this assumption can be relaxed). In practice, we consider p = 1 unless stated otherwise.

t The probability of transitioning from state p to state x in time t is simply Pp,x, or

t equivalently Qp,x since both p and x are transient states. Since the Markov chain is absorbing,

∞ X t −1 Qp,x = (I − Q)p,x (3.4.1) t=0

is finite [8], where I is the identity matrix. This finiteness allows us to fix x and p and specify a probability distribution of the allele age.

Qt Qt f (t) = p,x = p,x (3.4.2) p,x P∞ t −1 t=0 Qp,x (I − Q)p,x

A complete measure theoretic construction of this distribution can be found in the sup- plementary material 3.7. The exact moments of this distribution can be written in terms of the matrix Q by using matrix sum identities. We show the first three below using [A]b,c to denote the entry in the b-th row and c-th column of matrix A.

∞ P∞ t −2 X tQp,x [Q(I − Q) ]p,x µ = tf (t) = t=0 = (3.4.3) 1 p,x P∞ Qt [(I − Q)−1] t=0 t=0 p,x p,x

∞ P∞ 2 t −3 X t Qp,x [Q(I + Q)(I − Q) ]p,x µ = t2f (t) = t=0 = (3.4.4) 2 p,x P∞ Qt [(I − Q)−1] t=0 t=0 p,x p,x

49 ∞ P∞ 3 t 2 −4 X t Qp,x [Q(Q + 4Q + 1)(I − Q) ]p,x µ = t3f (t) = t=0 = (3.4.5) 3 p,x P∞ Qt [(I − Q)−1] t=0 t=0 p,x p,x

2 The expected allele age is given by µ1, and the variance is given by µ2 − µ1.

It is interesting, and relevant if the reader wishes to compute higher moments than those

listed above, to notice that the k-th moment µk is closely linked to the matrix polylogarithm

function Li−k(Q) by the following equation.

[Li−k(Q)]p,x µk = −1 (3.4.6) [(I − Q) ]p,x

where

∞ s X  ∂  Li (z) = zkks = z (z(1 − z)−1) (3.4.7) −s ∂z k=1

Combining equations (3.4.6) and (3.4.7) therefore allows for the rapid symbolic compu-

tation of the closed-form expressions for any moment µk.

3.4.2 Implementation

Computation of the moments in Equations (3.4.3),(3.4.4) and (3.4.5) can be greatly simpli- fied. This simplification requires obtaining a single LU decomposition of a sparse matrix and using it to solve multiple linear systems by back-substitution. The first step is to calculate the LU decomposition of (I − Q)T , where T denotes trans- pose. LU decomposition has a theoretical time complexity on the same order as matrix multiplication, and thus can be as large as O(n3) floating point operations for a dense n × n matrix. However, much faster solutions are possible for sparse matrices, which scale in terms

50 of the number of non-zero entries (e.g., [19, 49]). For Wright-Fisher models, Q and hence (I − Q)T , are typically very sparse (at machine precision), and thus a potentially large time savings can be obtained by exploiting this sparsity. Computation of the LU decomposition is by far the most time-intensive step, but we find it is still feasible for population sizes around 105 on typical workstation computers as of the time of writing [1]. The second step is to use forward and back substitution to solve multiple linear systems. Given the LU decomposition, this is quite fast and typically requires only a few seconds.

First we solve for M1 in

T (I − Q) M1 = ep (3.4.8)

T where ep is the p-th column of the identity matrix. Note that M1 is the p-th row of

−1 −1 (I − Q) , so that the x-th entry of M1 is in fact (I − Q)p,x as required in the denominator of Equations (3.4.3) and (3.4.4).

Next, we use the same LU decomposition to solve for M2 in

T (I − Q) M2 = M1 (3.4.9)

Notice that

2 T T T T ((I − Q) ) M2 = (I − Q) (I − Q) M2 = (I − Q) M1 = ep (3.4.10)

T −2 T so that M2 is actually the p-th row of (I − Q) . We next take the dot product of M2 with the x-th column of Q, which we call Qx.

51 T −2 −2 M2 · Qx = [(I − Q) Q]p,x = [Q(I − Q) ]p,x (3.4.11)

which is what was required in the numerator of Equation (3.4.3).

We repeat the procedure and solve for M3 in

T (I − Q) M3 = M2 (3.4.12)

Again, we have

3 T T T T ((I − Q) ) M3 = (I − Q) (I − Q) M2 = (I − Q) M1 = ep (3.4.13)

T −3 so that M3 is the p-th row of (I − Q) . In order to compute the numerator of the second moment, we also need the x-th column of Q(I + Q), which we call Ax. Note this does not in any way necessitate a full matrix multiplication, as we require only the x-th column. Although this is potentially an expensive O(n2) computation, in practice, sparsity makes it trivially easy. Now we have

T −3 −3 M3 · Ax = [(I − Q) Q(I + Q)]p,x = [Q(I + Q)(I − Q) ]p,x (3.4.14)

as required in the numerator of Equation (3.4.4). Hence we have calculated all necessary components of the expected value and variance as given in Equations (3.4.3) and (3.4.4). The computation of higher moments can be easily implemented as well. To do this, one would first use equations (3.4.6) and (3.4.7) to obtain closed-form expressions for the needed

52 moments. We recommend using a factored form of the expression so that matrix multiplica- tion is never required in the implementation (it is a convenient property of the polylogarithm that all closed-form expressions of Li−s(z) factor completely over the reals). The implemen- tation would then require extending the above algorithm as needed, i.e. iteratively solving

T (I − Q) Mk+1 = Mk (3.4.15)

T −k for Mk+1, where Mk is the p-th row of (I − Q) . We have implemented this approach for the first two moments in our software package

Wright-Fisher Exact Solver, WFES [1] (available at https://github.com/dekoning-lab/ wfes/). In practice it takes only seconds to minutes to calculate the relevant quantities for population sizes under Ne = 100, 000. As an aside, we note that the full probability distribution can also be feasibly approxi- mated for small Ne to an arbitrary degree of precision by taking the summation in equation 3.4.2 to some large finite value.

3.5 Simulations

In order to simulate a distribution of allele ages, we must reverse the process, i.e. use the reversed absorbing Markov chain. Specifically, the simulation will start at state x and essentially run backwards in time until it hits state p. It will then either keep going, or stop with a probability equal to the probability that the current visit to state p is the beginning of the chain (when the mutation first entered the population). This backwards simulation can be done by creating a reversed transition matrix and running it in a forwards simulation. We use the method presented in Chae et al. [43], which is as follows. The states of the reversed absorbing Markov chain are {1, 2, ..., 2Ne − 2, 2Ne − 1, stop}, where the stop state is absorbing and all others are transient. The reversed Markov chain does not regard fixation

53 or extinction as absorbing states, and in fact does not allow transition to these states at all. Let P 0 be the matrix of transition probabilities of the reversed absorbing Markov chain. In its canonical form,

  Q0 R0   P =   0 I

We have

0 Qk,jNp,k Qj,k = Np,j  N −1 if j = p, i = stop 0  p,p Rj,i =  0 otherwise

where Q and N are the transient-to-transient state transition matrix and the fundamental matrix, respectively, of the original Markov chain. (Note that N here is used by convention to represent the fundamental matrix and has no relationship to Ne defined above.)

3.6 Supporting information

3.7 Supplementary methods

The purpose of this appendix is to present a measure theoretic construction of the distri- bution of allele age, for a fixed p and x, where p and x are transient states of the Markov Chain. This is necessary to prove the correctness of the method presented. Let the sample space Ω be the set containing all finite realizations of the Markov Chain which begin with p and end with x. We write an arbitrary element of Ω as the ordered tuple

54 θ = 4Ne μ 0.9600 0.8539 0.7477 0.6416 0.5355 150000 80000 15000 25000 40000 20000 60000 30000 100000 10000 15000

A. h = 0.0 40000 20000 10000 50000 5000 Expected Allele Age 20000 10000 5000 0 0 0 0 0 120000 20000 12000 60000 30000 100000 10000 50000 15000 80000 8000 40000 20000 60000 6000 10000 B. h = 0.5 30000 40000 4000 20000 10000 Expected Allele Age 5000 2000 20000 10000 0 0 0 0 0 35000 20000 60000 12000 1e+05 30000 50000 10000 15000 25000 8e+04 8000 40000 20000 6e+04 6000 10000 C. h = 1.0 30000 15000 4e+04 4000 20000 10000 5000 Expected Allele Age 2e+04 2000 10000 5000 0 0 0 0 0e+00 −40 −20 0 20 40 −40 −20 0 20 40 −40 −20 0 20 40 −40 −20 0 20 40 −40 −20 0 20 40 Selection (2Ns) Selection (2Ns) Selection (2Ns) Selection (2Ns) Selection (2Ns)

0.4293 0.3232 0.2171 0.1109 0.0048 2500 6000 1500 5000 8000 2000 3000 4000 6000 1000 1500 2000 h = 0.0 3000 4000 1000 2000 500 1000 Expected Allele Age 500 2000 1000 0 0 0 0 0 8000 5000 1400 2000 3000 1200 4000 2500 6000 1500 1000 2000 3000 800 4000 1000 1500

h = 0.5 600 2000 1000 400 2000 500 Expected Allele Age 1000 500 200 0 0 0 0 0 8000 5000 1500 3000 2000 4000 6000 2500 1500 1000 2000 3000 4000 1500 1000

h = 1.0 2000 500 1000 2000 500 Expected Allele Age 1000 500 0 0 0 0 0

−40 −20 0 20 40 −40 −20 0 20 40 −40 −20 0 20 40 −40 −20 0 20 40 −40 −20 0 20 40 Selection (2Ns) Selection (2Ns) Selection (2Ns) Selection (2Ns) Selection (2Ns)

Figure 3.6: Expected allele age and variance as a function of selection, dominance, and mutation. Larger range of mutation rates

55 θ = 4Ne μ 0.01 0.05 0.10 0.03 0.03 ● Simulation estimate (n=10M) ● ● ● Exact computation ● ●

0.02 0.02 0.02 A. B. C. Frequency

0.01 0.01 0.01

0.00 0.00 0.00 0 0 0 100 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 >1000 >1000 >1000 Allele age (generations) Allele age (generations) Allele age (generations)

0.50 0.96

● ●

● ●

0.020

0.04

0.015 D. E.

Frequency 0.010

0.02

0.005

0.00 0.000 0 0 500 3000 6000 9000

1000 1500 2000 2500 3000 3500 4000 4500 12000 15000 18000 21000 24000 27000 >30000 >5000 Allele age (generations) Allele age (generations)

Figure 3.7: Simulated neutral allele age distributions. Larger range of mutation rates.

h=0.0 h=0.5 h=1.0

0.03 ● ● ● Simulation estimate (n=10M) 0.03 0.03 ● ● ● Exact computation

0.02 0.02 0.02 A. B. C. Frequency

0.01 0.01 0.01

0.00 0.00 0.00 0 0 0 100 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 >1000 >1000 >1000 Allele age (generations) Allele age (generations) Allele age (generations)

● ● ● 0.03 0.03 0.03 ● ● ●

0.02 0.02 0.02 D. E. F. Frequency

0.01 0.01 0.01

0.00 0.00 0.00 0 0 0 100 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 >1000 >1000 >1000 Allele age (generations) Allele age (generations) Allele age (generations)

Figure 3.8: Simulated non-neutral allele age distributions (θ = 0.01). Larger range of mutation rates. A-C: 2Nes = −3. D-F: 2Nes = 3

56 h=0.0 h=0.5 h=1.0

0.05 0.05 0.05 ● Simulation estimate (n=10M) ● ● ● Exact computation ● ●

0.04 0.04 0.04

0.03 0.03 0.03 A. B. C.

Frequency 0.02 0.02 0.02

0.01 0.01 0.01

0.00 0.00 0.00 0 0 0 3000 6000 9000 3000 6000 9000 3000 6000 9000 12000 15000 18000 21000 24000 27000 12000 15000 18000 21000 24000 27000 12000 15000 18000 21000 24000 27000 >30000 >30000 >30000 Allele age (generations) Allele age (generations) Allele age (generations)

● ● ●

● ● 0.08 ●

0.06

0.06 0.06

0.04 0.04 D. E. F. 0.04 Frequency

0.02 0.02 0.02

0.00 0.00 0.00 0 0 0 3000 6000 9000 3000 6000 9000 3000 6000 9000 12000 15000 18000 21000 24000 27000 12000 15000 18000 21000 24000 27000 12000 15000 18000 21000 24000 27000 >30000 >30000 >30000 Allele age (generations) Allele age (generations) Allele age (generations)

Figure 3.9: Simulated non-neutral allele age distributions. (θ = 0.96) Larger range of mutation rates. A-C: 2Nes = −3. D-F: 2Nes = 3

ω = (p, a1, a2, ..., an, x) ∈ Ω (3.7.1)

where n ∈ {0, 1, 2, ...} is arbitrary and finite and ai are transient states in the Markov Chain for 1 ≤ i ≤ n. It is worth noting that if p = x, then (p) ∈ Ω as well. The set Ω is (infinitely) countable because for each fixed length n, there are only finitely many realizations of the Markov Chain. Define F = 2Ω, the set of all subsets of Ω, so that F is trivially a σ- algebra satisfying the necessary conditions (that is, Ω ∈ F and F is closed under complement and countable union). To define a probability measure P : F → [0, 1], we first define P on each singleton. Define P ({(p)}) = 1/c if p = x, and for all other ω ∈ Ω,

57 1 P ({ω}) = P ((p, a , a , ..., a , x)) = Q Q ...Q (3.7.2) 1 2 n c p,a1 a1,a2 an,x

where c is a constant that will allow P (Ω) = 1. Since every element of F other than the empty set is a disjoint union of singletons, extending P beyond singleton sets to the rest of F is then just a matter of applying countable additivity and allowing P (∅) = 0.

Let us find the constant c. Define a function l :Ω → R such that l(ω) gives the number of

transitions that occurred in that specific realization, for example l((p, x)) = 1. Let Ωm ⊂ Ω be the subset containing all elements with m transitions, that is,

Ωm = {ω ∈ Ω | l(ω) = m} (3.7.3)

For a fixed m, we have

X 1 P (Ω ) = P ({ω}) = Qm (3.7.4) m c p,x ω∈Ωm

by the Chapman-Kolmogorov equation, where the entry is taken after the matrix power. Therefore,

∞ ! ∞ ∞ [ X 1 X 1 = P (Ω) = P Ω = P (Ω ) = Qm (3.7.5) m m c p,x m=0 m=0 m=0 1 = (I − Q)−1 (3.7.6) c p,x

−1 where I is the identity matrix, so that c = (I − Q)p,x. Thus we have fully defined the

probability space. It is now easy to define the random variable Y :Ω → R as Y (ω) = l(ω), with probability mass function

58   y  −1 Qp,x (I − Q)p,x if y ∈ {0, 1, 2, 3, ...} fY (y) = (3.7.7)  0 otherwise

59 Chapter 4

A general Markov-modulated Wright-Fisher model of time-heterogeneous evolution

Contributions

BS, IK, and JdK developed the methods and models, IK implemented the methods, IK and JdK analyzed the data and wrote the paper. This chapter is under preparation for submission to Genetics or MBE.

4.1 Abstract

An important limitation of the Wright-Fisher model is that it assumes constant population size and no variation over time in parameters such as selection coefficients. Since such time- heterogeneity is both commonly encountered and important in real populations, a great deal of work has been done to relax these assumptions. Several of the most widely-used such approaches are based on diffusion approximation to the Wright-Fisher model. However, these diffusion approaches can suffer from a variety of limitations, not unlike those discussed in

60 previous chapters. In addition, coalescent approaches have been proposed but have proven analytically challenging. Here, I describe a novel and versatile extension to the discrete time Wright-Fisher Markov model, which is very straightforward to analyze - the Markov- modulated Wright-Fisher (Markov) model (MMWF ). By embedding a set of Wright-Fisher processes into a parent stochastic process governing switches between model regimes, MMWF allows consideration of time-dependent parameters that are piece-wise constant over epochs (whose timing and duration may be deterministic or probabilistic). Using this approach, I show that MMWF models can reveal novel insights about molec- ular evolution under time-heterogeneous population size. In particular, I consider changes in population size over time and consider periodic changes, episodic expansions, and gen- eral piece-wise demographic histories. Using these models, I show a few new caveats on the use of the classical harmonic mean approximation to the effective population size. Fi- nally, I propose a computationally efficient method for conveniently calculating the expected allele-frequency spectrum (AFS) at arbitrary times, which I compare to existing work.

4.2 Introduction

The Wright-Fisher (WF ) model of population genetics [50, 51] is a classical description of how allele frequencies are expected to change in a population of finite and constant size over time. Much current theoretical and methodological effort in population genetics, genomics, and molecular evolution is based upon WF and derivative models. Largely for the purposes of mathematical tractability [52], WF assumes a constant population size of N reproducing individuals, which limits its direct applicability to many problems of biological interest. As a result, many authors have considered ways to build population genetic models that incor- porate different forms of temporal heterogeneity. These studies have primarily focused on modelling changing population size, although some have also considered changes in selection cites and in general aspects of the model (e.g. Steinrucken et al. [45]).

61 A commonly used assumption in this work, which we also employ here, is the piece-wise constant population size. Although this lacks some of the generality of some diffusion ap- proaches, which can more conveniently model continuous changes like exponential growth over many generations, several authors have nevertheless employed similar piece-wise con- stant models. For example, [45] recently presented a very general piecewise constant diffusion approach based on an approximation to the transition-density function of a Wright-Fisher

diffusion, SpectralTDF. Here, I show that a full Markov model treatment is substantially more straightforward, is computationally tractable when sparsity and other features are ex-

ploited, and that is appears more reliable than SpectralTDF. Our implementation is also found to be significantly faster than SpectralTDF when population sizes are only modestly large. A significant benefit of the MMWF framework is the ability to easily compute a va- riety of quantities of interest by exploiting known (and novel) properties of Markov models. Indeed, computation of the fundamental matrix of MMWF (and of derivative quantities) is nearly identical to the single population case, making extensions of a variety of computations to the model natural and straightforward.

4.2.1 Background

Both stochastic and deterministic changes in population size have been modelled using a variety of approaches, including branching process formulations [53], forward diffusion ap- proaches [54, 55, 56], simple Markov models [57], and in coalescent-based approaches [28]. For a comprehensive and fairly recent review of many of these approaches, see [56] and references therein. Many of these approaches differ in their desired endpoints. The effect of time-heterogeneity in population size on fixation probabilities, specifically, has been particularly well studied. Such computations are often performed for a Wright-Fisher population with an effective

population size, Ne, that conserves some statistic between the approximating model and a real population of interest that is changing size. This statistic is commonly describing some

62 aspect of random genetic drift, but can be chosen to represent any desired property [5]. When a population changes size periodically, for example, under seasonal fluctuations, the harmonic mean (HM ) of population size has been suggested as an appropriate Ne. When the proportion of time a population spends with size Ni is pi, the HM effective population size is given by:

!−1 X pi N = (4.2.1) e N i i The use of the HM approximation can be traced back at least to Wright [58], who sug- gested by verbal argument that it is appropriate for ”cycle(s) of not too long a period”. [57] was probably the first to reach this result explicitly, where he considered the effect of changes in population size using a simple Markov-model formulation with reversible and equiprob- able transitions between population sizes [57]. In that work, he showed that the harmonic mean effective population size produces the exact fixation probability expected from the time-variable population. Karlin [57, p. 450] emphasized ”the result is exact, provided the variation of population number from generation to generation undergoes uniform stochastic fluctuations”. Gillespie [59] later showed that the harmonic mean effective population size can approximately be arrived upon by simple, general argument, which follows from a study of the rates of decrease in heterozygosity of neutral variations over time. However, Gillespie [59] did not discuss the conditions under which this approximation might break down. Rice [6] points out that one such assumption is that the population size is not close to zero at any point of time. Using a branching-process model of changing population size, Otto and Whitlock [53] later considered such fluctuating population size, and showed that the HM approximation is expected to break down when the timescale of cyclic population size changes is long relative to the strength of selection (as foreshadowed by Wright [58]). Here, I confirm these results in a Markov model formulation of the expected time to fixation, and also show that the variances of expected times are not well described by the harmonic mean. I will also show

63 that the harmonic mean can break down in several other ways when population size changes are episodic or non-reversible. All models are implemented in our new software package, WFES2 (Chapter 6).

4.2.2 Consideration of selection

A thorough treatment of the branching process model with variable population size has been provided by Otto and Whitlock [53]. The authors derive approximate expressions for the probability of fixation under one-time change in population size, logistic / exponential growth / decay, and periodically fluctuating population sizes. The branching process framework yields simple closed-form approximations. However, as pointed out by Otto and Whitlock [53], it applies only under positive selection. To deal with this limitation, the authors made an argument based on the symmetry of fixation times with respect to selection [60] in order to extend its application to deleterious variants. However, here I will show that this symmetry breaks down under variable population size (also see Chapter 3, where we demonstrate other violations of the symmetry of fixation times with respect to selection). Waxman [47] considered properties of the diffusion approximation under time-variable parameters. The author derived integral expressions to solve for the probability of fixation under any form of population-size function. Specifically, his work focused on the relative contributions of fluctuations in selection and population size, arguing that population size changes have a larger effect on probability of fixation. Only cases of increase and decrease in population size were considered, but the results apply to any value of selection, unlike the branching process treatment.

4.2.3 From fixation probabilities to allele frequency spectra

One of the direct applications of the WF model is the calculation of expected allele frequency probability distributions and how they change over time. Such computations are of direct interest for quantifying the behaviour of models, but also form the basis of several widely-

64 used inferential methods for estimating population demographic histories and distributions of fitness effects (DFEs) [61, 62]. In this work, we use MMWF to calculate the expected AFS of an entire population following a given episodic demographic history. A related method was employed by [16] in order to model allele frequency distributions, but only with small N. The work of Song and Steinrucken [46, 45] also has the goal of finding the AFS following a variable population size demography, and we directly compare our results to theirs. As we pointed out in [1], an advantage of our approach is that it is completely general with respect to the assumptions of the underlying model, and thus allows different parameter regimes to

be explored with ease. Unlike in SpectralTDF, we can easily explore the effect of extreme or previously inaccessible parameter regimes. By comparison, SpectralTDF seems to often crash and can experience large numerical errors if population size jumps, mutation rates, or selection coefficients are ‘too large’ [45], personal communication. The rest of this chapter is organized as follows. I first describe the MMWF construction, and outline how some core properties of the model are calculated. I then show that MMWF and the harmonic mean approximations agree when modeling fluctuating population size, but differ in cases of non-reversible change in population size (e.g., following a specified demographic history). I then demonstrate the application to the AFS calculation, the dif- ferences compared to the diffusion methods, and discuss the computational and modeling advantages of MMWF in this context.

4.3 Methods

4.3.1 Time-homogeneous Wright-Fisher model

In a population of N reproducing individuals, the classical Wright-Fisher model describes the behavior of a biallelic locus with reference (wildtype) state a and derived state A, the focal allele. Given i copies of A and 2N − i copies of a, the transition probability function for the expected count of allele A in the next generation is:

65 2N P = ψj(1 − ψ )2N−j (4.3.1) i,j j i i

where ψi can vary according to the assumptions of the model. We define ψ throughout using equation (2.3.14), which includes bidirectional mutation, selection, and dominance. States i = 0 and i = 2N are usually defined to be absorbing (A = {0, 2N}), and correspond to extinction and fixation of A respectively. The set of states A includes all non-absorbing (transient) allele count states. Equation (4.3.1) defines an absorbing discrete-time finite Markov chain. For any such Markov process, there exists a fundamental matrix N (note that we use the same symbol, non-bold N for the unrelated value of population size). The fundamental matrix has a very useful probabilistic interpretation as the expected number of generations prior to absorption (extinction or fixation), t, spent in state j, conditional on starting in state i:

N = {Ni,j = E(tj|i); i, j ∈ A} (4.3.2) ∞ X N = Qk = (I − Q)−1 (4.3.3) k=0

Above, Qi,j contains transition probabilities between transient states (A), and I is an identity matrix of appropriate dimension (2 × 2 with two absorbing states - fixation and extinction). The transition probability matrix Q has the dimensions |A| × |A| = (2N − 1) × (2N − 1) in the case of two absorbing states. We provide the technical details in the Supplement 4.6. For related proofs, the reader is referred to [8, section 3]. The fundamental matrix (4.3.2) allows direct calculation of quantities describing many important behaviors of the population during its transient phase and eventual termination, including the expected times and probabilities of extinction and fixation. As we will discuss below, the fundamental matrix can also be used to compute the variance and higher moments on the times to absorption, and other important statistics. While computing the entire

66 matrix N is computationally expensive for even modest population sizes, it is possible to reduce the solutions for many, properties to low-dimensional linear systems [1]. For example, the mean of the time to absorption (unconditional on which absorbing state is reached) given a starting state i is:

2N−1 X E(Ti) = Ni,j (4.3.4) j∈A

where the summation is over the ith row of N. We are typically interested in cases where the population starts monomorphic for the wildtype allele, or with a small number of copies

(the initial number of copies being determined by the forward mutation rate µa→A [2]).

In these cases, we require only a few rows of Ni,·, corresponding to the starting states [2] with non-negligible probabilities under the mutation model, which can be solved from the linear system:

T (I − Q) Ni,· = Ii,· (4.3.5)

Likewise, we can find the variance of the absorption time from the fundamental matrix:

2 V ar(Ti) = (2Ni,· − Ii,·) E(Ti) − E(Ti) (4.3.6)

Again, this can be solved in terms of linear systems (see Supplement 4.6). The properties described above apply irrespective of the parameterization of the WF model, as long as we can construct the transition matrix Q. Therefore, these results are general and work equally well for any transition matrix (i.e. with any mutation, selection, migration or dominance parameters). Importantly, they apply to the Markov-modulated model described below.

67 αK,1

αK,2

α2,1

W1 W2 ··· WK

α1,2

α2,K

α1,K

Figure 4.1: Markov-Modulated Wright-Fisher model. Each node Wx represents a submodel, which is a full Wright-Fisher model. Switching between the component submodels are mod- ulated by jump matrix α with rate parameters αx,y and starting probabilities p0(x). Note that self-transitions are not shown.

4.3.2 Time-heterogeneous Wright-Fisher model

To express time-dependent parameters, we nest K Wright-Fisher models into a parent

Markov process (Fig 4.1). Given WF models W1,W2,...,WK , each with a respective set of parameters, we then have a K-state parent Markov process, which governs switching be- tween submodels in time, coupled to K WF models, which govern the dynamics of allele count evolution. This formulation produces a single Markov chain model with a compound state space across models and allele counts. We call this a Markov-modulated Wright-Fisher model (MMWF ) when absorbing states are organized appropriately. The number of ab- sorbing states is K × 2, when extinction and fixation are allowed in each of the submodels. The absorbing states can also be grouped into 2 effective absorbing state classes correspond- ing to extinction and fixation in any submodel (discussed in the Supplement). The general structure of this model is similar to how non-absorbing continuous-time Markov-modulated models of codon substitution are specified, which is a class of models that includes those of the so-called covarion type [63]. Transition probabilities for sub-matrices along the block diagonal refer to probabilities of allele count change when the parent process stays in the same state. In this case, the corresponding transition probability sub-matrices remain the same as in (4.3.1) (up to a normalizing constant). For sub-matrices off the block diagonal, we need to specify the

68 transition probabilities from every state of the submodel Wx to each other submodel Wy. We do this with a slightly modified Wright-Fisher transition function, similar to (4.3.1):

2N  P = y (ψ )j(1 − ψ )2Ny−j (4.3.7) i,j;x,y j i;x,y i;x,y

Note that in comparison to (4.3.1), we have added dependence on the submodels x and y, with their respective population sizes Nx and Ny. The equation for ψi;x,y is given in (2.3.15)

- note that i is the current allele count in Nx. The full transition probability matrix of the MMWF process then has the following form:

  (W1,W1) (W1,W2) (W1,WK ) α1,1P α1,2P ··· α1,K P    (W2,W1) (W2,W2) (W2,WK )   α2,1P α2,2P ··· α2,K P  P =   (4.3.8)  . . . .   . . .. .      (WK ,W1) (WK ,W2) (WK ,WK ) αK,1P αK,2P ··· αK,K P

Properties such as mean time to absorption (4.3.4) and variance of time to absorption (4.3.6) can be calculated in a similar manner as in a single WF model, though they sometimes require integration over the starting and terminating models (see Supplement for full details). For convenience, we define the initial probability vector P(0) across the entire state space of the MMWF model. It will sometimes be convenient to consider this as the product of

(x) initial allele count probabilities f (0)i for allele count, i, in each model, x, and a set of initial submodel probabilities, π(0). Therefore we define P(0) for allele count i in submodel x as:

(x) P(0)(i,x) = f (0)iπ(0)x (4.3.9)

The initial allele counts in each model can then be set according to the probability of generating n new mutants per generation, given that we start with zero mutants (i.e.,

P0,n, using the transition probabilities appropriate for each submodel). We also sometimes

69 initialize f (x)(0) to correspond to a single new mutant in a specified starting population.

4.3.3 Allele frequency spectrum calculation

We develop two approaches for computing expected allele frequency spectra under piece- wise constant models. The first uses iterative sparse matrix-vector multiplications using the transition matrix of a non-absorbing single population model, and a special jump matrix to handle transitions at the boundaries of epochs (similar to in [16]). In this approach, epochs are defined deterministically, and the compute time scales with the length of the demographic model considered. The second uses the MMWF formulation in a novel way, and makes the expected AFS computation insensitive to the length of the demographic model. Epochs in this approach are defined stochastically, so that epoch durations are geometrically distributed around a desired mean duration. Like SpectralTDF, this approach’s time complexity scales with the number of epochs considered, and with the overall state-space of the model.

Deterministic jumps

Given an initial allele count distribution f (1)(0), the probability distribution of allele counts at time t can be straightforwardly calculated as f(0)×Pt. The transition probability matrix P is a standard WF transition matrix (4.3.1), but with all states defined as transient (non- absorbing). It follows that for a demographic history with population sizes N1,N2,...,NK , P with durations g1, g2, . . . , gK generations, we can calculate the AFS at time t = i(gi) as:

f(t) = f (1)(0) × Pg1 × P × Pg2 × · · · × P × Pgk (4.3.10) W1 W1,W2 W2 WK−1,WK Wk

where the transition probability matrices PWx,Wy are given by (4.3.7). Computing this equation would be very time consuming, since it involves taking large matrices to typically large powers. Instead, we can iteratively update the current probability distribution of allele frequencies at each time step, by multiplying the starting distribution (in the previous

70 generation) by the transition matrix. Since the transition matrix is sparse, this approach requires only sparse matrix-vector products, which can be computed very efficiently. This approach is algebraically exact, as we do not introduce further approximations or additional assumptions other than those in the model itself.

Stochastic jumps

The demographic history can also be modeled with MMWF in a straightforward manner.

The one generation switching probabilities can be specified, for example, for Wi to Wi+1, with probability 1/gi, so that the epoch durations will be geometrically distributed with a mean of gi generations. As needed, transitions between submodels can be defined to only allow sequential transitions, such that the model begins in W1 and transitions to W2, ..., WK (note this is not a requirement, and we also consider arbitrary stochastic switching model parameterizations). Unlike for the general MMWF model with 2K absorbing states, this model allows absorption only at the termination of the demographic history in WK (typically at present day). To output the AFS, we make each allele count state in the terminating submodel an absorbing state, so we can compute the probability of ending in each allele count using standard absorption probability computations (2.3.6). Since absorption here simply refers to reaching present day, the AFS is then the vector of absorption probabilities across all allele count states. To the best of our knowledge this is a novel idea. As in our previous work (Chapter 2), we can rapidly solve for the entire column i of

B using (I − Q)Bi = Ri (equation 2.3.12, Chapter 2). Since we have the entire column, it is computationally trivial to integrate a variety of computations over alternate starting distributions. In contrast, in the deterministic model of the previous section, we need to repeat the entire calculation for each new f0. This variant of the MMWF has a transition matrix of the form:

71  

 1 1   (1 − )QW ,W QW ,W 0 ··· 0 0   g1 1 1 g1 1 2         1 1   0 (1 − )QW ,W QW ,W ··· 0 0   g2 2 2 g2 1 2          Q R   P =   =  ......     ......  0 I          1 1   0 0 0 ··· (1 − )QW ,W I   gK K K gK           0 0 0 ··· 0 I 

(4.3.11) Note here that the notation regarding the absorbing state implies that that the absorbing state is actually a matrix, indicating that each allele count in the terminating submodel is a separate absorbing state. Unlike in the previous section, each of the component submodels, except the terminal one, is non-absorbing, so we do not consider fixation or extinction in each epoch.

Coarse approximation to AFS in reduced population sizes

We can reduce the state-space, and thus improve the computational performance of our method, by introducing a tunable approximation in the AFS calculation. Instead of consid- ering each allele count as a separate state in the model, we may instead track states that correspond to allele frequencies of interest, at some precision. For example, if we seek the AFS for the full population of size N = 10, 000, we can instead perform the calculation in an approximating population of N = 1000, and get the expected AFS to the closest tenth of a percent, with an approximating factor of ten. This idea has been widely applied in the theoretical population genetics literature in a variety of different contexts (e.g., [64]).

72 In order to approximately preserve the effects of mutation and selection, we require that γ = 2Ns and θ = 4Nµ are preserved between the full model and the approximate small population size model. The same applies to the duration of each epoch, such that if we use a factor of ten approximation as above, the epoch duration becomes ten times shorter. Note that in the case of multiple epochs in a demographic history, we do not have to approximate every epoch, and only use the approximation for the epochs with larger population sizes. The accuracy of this approach is evaluated by comparing how close results are compared to the full model (with original population sizes). To improve the behavior of this approximation around the tails of the distribution, we multiply the AFS acquired from the small population size model Napprox by a jump matrix

Papprox,full (4.3.7). This in effect adds a single generation transition from the Napprox to Nfull.

This approach appears to work reasonably well with Napprox > 1000. The approximation applies to both deterministic and stochastic switching models. However, in this chapter we only apply it to the stochastic model, where it significantly reduces the computation time.

4.4 Results

4.4.1 Fluctuating population size

First, we examine a model that fluctuates periodically between two population sizes, fol- lowing assumptions that are consistent with models considered by Karlin [57]. We set up a two-state switching matrix, with parameter p corresponding to the expected proportion of

time spent in population with size N1:

  p 1 − p   α =   (4.4.1) p 1 − p

If this model were allowed to go on forever, and if the compound Markov chain were non-absorbing, the distribution across submodels would converge to (p 1 − p). We use this

73 as an appropriate starting condition for the switching process. For these experiments, the population is initialized with a single mutant. Since this switching process is reversible and is consistent with Karlin’s assumptions [57], we expect MMWF will agree with the harmonic

mean approximation in this case. The harmonic mean effective population size Nh is given by:

 p 1 − p−1 Nh = + (4.4.2) N1 N2

We refer to the model with population size Nh as a harmonic mean model (HM ). Figure 4.2A shows the expected time and standard deviation of the time to fixation, as a

function of time spent in N1. The model fluctuates between N1 = 1000, and N2 = 2000, in the absence of selection. The left side of the graph corresponds to the majority of time being

spent in N2 (TFix ≈ 4N2), while on the right side of the graph, where p → 1, most of the time

is spent in N1 (TFix ≈ 4N1). The statistics for MMWF are shown in red, and the statistics for HM in blue - they overlap entirely indicating complete agreement. Figure 4.2B shows the probabilities of fixation, which also match between MMWF and HM. The dashed red line

shows the probability of an allele fixing in N1, and the dotted red line shows the probability

of fixing in N2, calculated from the switching model. These two probabilities sum together to yield the total probability of fixation (4.6.1). When a small proportion of time is spent

1 in N1 (4.2B, left side), the fixation probability is driven by population N2 (PFix ≈ ), but 2N2 1 when the majority of time is spent in N1, most of the fixations occur in N1 (PFix ≈ ). 2N1

4.4.2 Increasing population size

Next, we examine the case of expansion in population size from N1 = 1000 to N2 = 2000 after an average of g generations. This case represents a simple non-reversible demographic history, and is similar to the expansion scenario considered by Otto and Whitlock [53]. A population starts with a single mutant in population N1, and jumps to N2 after an average

74 A B 0.0005 12000 n o i t a i v

e 0.0004

d 10000

d r a d n a t Model s 0.0003 8000

± MMWF

n Harmonic o i t

a Fixation in N1 x i f 6000 0.0002 Fixation in N2 o t

e Probability of fixation m i t

d e

t 4000 0.0001 c e p x E

2000 0.0000

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of time in N1 Proportion of time in N1

Figure 4.2: Fluctuating population size in a reversible switching model - MMWF (red) and HM (blue) models agree completely. (A) Expected time (line) and standard deviation (shaded) of fixation time versus expected time spent in N1. Increasing values of t represent larger amounts of time spent in N1. Theoretical expectation for single population size model is TFix ≈ 4N.(B) Probability of fixation versus expected time spent in N1. Dashed red line shows the probability of fixing in N1. Dotted red line shows the probability of fixing in N2. Population sizes: N1 = 1000, N2 = 2000; neutral variant (s = 0); initialized with a single mutant.

75 of g generations. The corresponding switching matrix is:

  1 − 1 1  g g  α =   (4.4.3) 0 1

This matrix imposes a geometric distribution on the time of the jump. After the jump, the population stays at size N2. Unlike in the previous case of fluctuating population size, this switching matrix is absorbing and thus non-reversible. Unlike in equations (4.4.1) and (4.4.2), there is no appropriate way to get the proportion of time spent in each model from the jump process alone - the switching probability matrix has a degenerate equilibrium distribution (0, 1). However, we can compute the expected amount of time spent in each model prior to absorption from the fundamental matrix (4.3.2) and use this to compute the appropriate HM. Depending on whether we are calculating fixation or extinction statistics, we would use the time spent in each submodel conditional on extinction or fixation respectively. Based on which time measure we use, we therefore get fixation or extinction effective population sizes [53]. The fixation effective population size is:

 −1 π¯(N1|Fix) π¯(N2|Fix) NFix = + (4.4.4) N1 N2

whereπ ¯(N1|Fix) is the mean proportion of time spent in N1 prior to fixation (4.6.3). Figure 4.3 shows the comparison between MMWF and HM, calculated using the fixation effective population size (4.4.4). In Fig. 4.3A, the expected time to fixation matches between MMWF and HM, but the standard deviation is larger with MMWF in some cases. As before, the average time to fixation on the boundaries of the parameter range correspond to

theoretical expectations with fixed population size (left ≈ 4N2, right ≈ 4N1). Figure 4.3B shows the probabilities of fixation as before. The first notable feature of these results is that for the MMWF model, the overall fixation

probability (in either population) is constant with respect to the expected time spent in N1,

and appears determined by the starting population size (i.e., PFix ≈ 1/N1). This is a

76 consequence of equation (4.3.7), which assumes at the time of switch the new generation is randomly sampled from the old population, and should thus have similar allele frequencies, subject to the effects of modelled forces over the course of this one generation. Figure 4.8 in the supplement shows an example simulation of a population size doubling from N1 = 1000 to N2 = 2000. At the time of switch, the count of the mutant allele approximately increases in direct proportion to the magnitude of the population size increase (subject to (4.3.7)).

This is relevant because for a neutral variant, PFix is approximately x/2N, for x initial mutants. When starting in N1 and assuming x = 1, the probability of fixation is expected to start at 1/(2N1). As time goes on in N1, assume the mutant allele frequency remains near its expected value of 1/(2N1) (with no selection, mutation, or migration) until the population size changes. Then, at the moment after the population size switches, the expected allele

1 count will be ≈ × (2N2). At this moment, the probability of fixing in N2 will thus be 2N1

2N2 /2N2 = 1/N1, which is the same as the probability of fixation in the starting population. 2N1

This result can thus be seen as a consequence of deterministically starting in population N1, together with the assumption that the new population is randomly sampled from the old. The second notable feature is that the distribution of expected fixation times is broader under MMWF than HM for the range of switching parameters where the probabilities of fixation in both populations are non-negligible (i.e., when there is non-negligible uncertainty about which population a fixation might occur in; Fig. 4.3A, B). The point where the standard deviation of MMWF maximally exceeds the standard deviation of HM (Fig. 4.3) corresponds to the point where the breadth of the fixation time distribution is maximal in

MMWF. Interestingly, this occurs when it is equally likely that a fixation will happen in N1

or N2 (Fig. 4.9). That is, the HM model performs most poorly at approximating the fixation time distribution’s higher moments when the uncertainty in which population a fixation may occur in is maximal. We now consider the same scenario of doubling the population size, but with selection. The mean time of population size increase is chosen to be t = 5, 000, where the standard

77 A B 0.0005 12000 n o i t a i v

e 10000 0.0004 d

d r a d n a t Model s 8000 0.0003

± MMWF

n Harmonic o i t

a Fixation in N1 x

i 6000 f 0.0002 Fixation in N2 o t

e Probability of fixation m i t

d 4000 e

t 0.0001 c e p x E 2000 0.0000

0 1 2 3 4 5 6 0 1 2 3 4 5 6 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Expected number of generations (t) in N1 Expected number of generations (t) in N1

Figure 4.3: Instantaneous doubling of population size from N1 = 1000 to N2 = 2000, after average of t generation (x-axis, log-scale). MMWF is in red, HM is in blue. (A) Expected time to fixation shown as solid line and standard deviation as a shaded region. Boundaries agree with theoretical expectations. The expected time to fixation agrees between MMWF and HM, but MMWF has a higher variance at intermediate values of g.(B) Probability of fixation in MMWF (red) and HM (blue). Probability of fixation is determined by the starting population size N1 in MMWF, but is time-dependent in the harmonic mean model. Probability of fixation in N1 and in N2 are displayed as red dashed and dotted lines respectively. The probabilities of fixation in each submodel vary over time, but add up to a constant value.

78 deviations in the neutral case are close to maximally dissimilar between MMWF and HM (Fig. 4.4).

A B

9000 n

o 0.0025 i t a i

v 8000 e d

d r 0.0020 a 7000 d n a t Model s 6000

± MMWF 0.0015 n Harmonic o i t

a 5000 Fixation in N1 x i f Fixation in N2 o

t 0.0010

e 4000 Probability of fixation m i t

d e t 3000 c 0.0005 e p x

E 2000 0.0000

4 2 0 2 4 4 2 0 2 4 Selection Selection

Figure 4.4: Instantaneous doubling of population size from N1 = 1000 to N2 = 2000, after average of 5, 000 generations in N1, with variable selection (x-axis). MMWF in red, harmonic mean in blue. Selection is the population-scaled selection coefficient. (A) Expected time to fixation shown as solid line and standard deviation as a shaded region. Neither the expected time to fixation nor its standard deviation agree, particularly for negative selection. (B) Probability of fixation in MMWF (red) is higher than for HM (blue) for positively selected and nearly neutral variants.

In this parameter range, neither times nor probabilities agree closely between MMWF and HM. The probability of fixation given by MMWF notably exceeds that of HM (Fig. 4.10) for nearly neutral and positively selected variants, but is slightly smaller than HM for negatively selected variants. We also note that neither HM nor MMWF are symmetric with respect to selection in this scenario (panel A). This is a counterexample to the symmetry of fixation times with respect to selection [25, 53] when population size is changing. The effect appears similar to the stochastic slowdown phenomena described previously [32, 34, 2] (Chapter 3).

79 4.4.3 Distribution of allele frequencies

We now turn our attention to the expected distribution of allele frequencies under time- variable population size. We compare the calculations from the deterministic and stochastic

models with SpectralTDF, a diffusion theory method described in [45]. In the examples below, we consider an arbitrary demographic history over three epochs with N = (1000, 500, 10000) of (expected) duration g = (400, 200, 100). In all cases, we use the equilibrium distribution of allele frequencies in the corresponding non-absorbing model of the initial population as the initial condition, f (0)(0). We examine the AFS for neutral

(2Nancs = 0) and deleterious (2Nancs = −10) variants in the following experiments. Figure 4.5 shows the AFS at present day, calculated by the three methods. We see a general agreement in the shape of the distribution - the majority of the mass is around the tails, producing the characteristic U-shape [58, 5]. However, there is slight disagreement between the methods, despite that the fundamental parameters of each model are set to be equivalent. Some disagreement is expected owing to the slightly different assumptions and/or approximations used by each method. The stochastic model calculation was performed with a scaling factor of 10, which did not appreciably affect the calculation (Fig. 4.11). Note that SpectralTDF produces an unscaled distribution, and does not solve for the distribution tails at the absorbing boundaries (0 or 2N copies respectively), since absorbing boundaries are problematic to deal with in diffusion approaches (e.g., Chapter 3). In the figures below, we empirically translated the SpectralTDF distributions so that there is maximal agreement with other methods at the midpoint of the AF range.

Figure 4.6 shows the AFS with 2Nancs = −10. While the deterministic and the stochastic model agree relatively closely, the SpectralTDF results deviate more significantly. This devi- ation can be traced to differences in the effective transition probability matrices used by the different approaches. SpectralTDF employs an eigendecomposition of the transition density function, which is stitched together across epochs, while the MMWF -based approaches use full transition probability matrices as described above.

80 A B 100 100

10 2 10 2

10 4 10 4 Model Deterministic Stochastic SpectralTDF Probability 10 6 Probability 10 6

10 8 10 8

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.01 0.02 0.03 0.04 0.05 Allele frequency Allele frequency

Figure 4.5: Full allele frequency spectra after a non-equilibrium demography for neutral variants, starting at mutation-drift equilibrium in the ancestral population. Probabilities (y-axis) are shown on a log scale. (A) Full AFS. (B) lower 5% of the AFS from panel A. Population sizes N = (1000, 500, 10000), epoch lengths t = (400, 200, 100), per-generation mutation rate µ = 1 × 10−9.

A B 102 102

100 100

10 2 10 2

10 4 10 4 Model Deterministic Stochastic 10 6 10 6 SpectralTDF Probability Probability

10 8 10 8

10 10 10 10

10 12 10 12

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.01 0.02 0.03 0.04 0.05 Allele frequency Allele frequency

Figure 4.6: Full allele frequency spectra after a non-equilibrium demography for deleterious variants, starting at mutation-selection equilibrium in the ancestral population. Probabilities (y-axis) are shown on a log scale. (A) Full AFS. (B) lower 5% of the AFS from panel A. Population sizes N = (1000, 500, 10000), epoch lengths t = (400, 200, 100), selection −9 coefficient 2Nancs = −10, per-generation mutation rate µ = 1 × 10 .

81 To help illuminate the differences among the approaches, we investigated the changes in

allele frequency distributions over time between SpectralTDF and MMWF, starting with a single mutant (Fig. 4.7). The change per-generation in the MMWF model is larger than

in SpectalTDF (Fig. 4.6). This may be expected of a diffusion approximation, since the change per generation is assumed small both when taking the diffusion limit [65], and when discarding higher moments. No such assumptions are made, however, in the Markov model. It is unclear if the magnitude of the observed variation can be explained by this, however, and some of the difference could also be due to other approximations unique to the way the spectral decomposition of the transition density function is performed in SpectralTDF. Adjusting the several tunable parameters in SpectralTDF according to recommendations by the authors did not appreciably change the results (not shown). We expect that the Markov model implementation is more likely to be correct, since it performs an exact forward-time calculation. We forgo the simulation validation here, since it will be prohibitively expensive, since sufficient samples will have to be generated for low probability events (note probabilities below 10−8 in Fig. 4.5). A deeper investigation is required to establish the underlying reason of the differences. A further point of difference between the approaches is the time required to perform the computation. Table 4.1 compares computation times for the AFS while varying the population size in the last epoch. Every method needed an increasing amount of time with an increasing population size. It appears that the approximate version of the stochastic model scales best with increasing population size, remaining the fastest method for any population size we considered. The deterministic approach to the AFS calculation is relatively fast due to the efficiency of sparse matrix-vector multiplication. However, it scales relatively poorly with long epoch times of greater than 50,000 generations (Table 4.2). In comparison, the stochastic calculation and SpectralTDF were both insensitive to epoch length. The stochastic approach was thus always faster than SpectralTDF under the conditions we considered, and this was often so by a large margin when run on the same machine.

82 A B 100 100

Time 1 2 1 1 10 10 3 4 5 6

Probability Probability 7 10 2 10 2 8 9 10

10 3 10 3 0 5 10 15 20 0 5 10 15 20 Number of copies Number of copies

Figure 4.7: Changes in allele frequency distribution function starting at a single allele copy, over single generations. (A) MMWF discrete model, and (B) SpectralTDF. With the discrete approach in MMWF, larger per-generation changes in allele frequencies are allowed, while they are restricted by assumption in the diffusion models.

Terminal pop. size Exact Stochastic SpectralTDF 10000 9.93 10.14 90.07 20000 24.68 17.34 153.20 30000 44.58 27.30 251.61 40000 95.22 35.07 278.87 50000 91.83 43.26 337.94 60000 170.99 56.95 493.13 70000 151.68 63.00 569.83 80000 184.85 71.66 529.41 90000 247.51 83.12 729.35 100000 265.76 105.92 642.23

Table 4.1: Computation time for the AFS using different methods, as a function of pop- ulation size (last epoch). The stochastic model used an approximation factor of 10× for the terminal epoch. Measurements are in wall-clock seconds, taken on an 8-core Intel Xeon processor (using a single thread for each method). Note that for the larger population sizes, the stochastic method will benefit substantially by using multiple threads (which is avail- able as an option in WFES2 ). Replicates were unnecessary since each method is a direct computation, not a simulation.

83 Epoch duration Exact Stochastic SpectralTDF 1000 34.85 10.13 160.33 2000 62.62 10.12 160.13 3000 90.47 10.12 159.23 4000 118.02 10.14 159.63 5000 145.93 10.11 159.22 6000 173.29 10.13 158.95 7000 201.24 10.14 159.01 8000 228.93 10.14 159.02 9000 256.40 10.15 158.87 10000 287.32 10.15 159.91

Table 4.2: Computation time for the AFS using different methods, as a function of the epoch length (last epoch, N = 10000). The measurements are in wall-clock seconds, taken on an 8-core Intel Xeon processor.

4.5 Conclusions

The Markov-modulated Wright-Fisher model MMWF is a versatile modeling approach for populations with time heterogeneous parameters. In this chapter, we addressed several simple models of changing population size. In future work (Chapter 5), we will address applications with time-variable selection in the same modeling framework. The MMWF framework is a natural extension to the standard Wright-Fisher model, since it simply extends the state-space of the model to allele counts across a set of alternating model regimes (4.3.7). Due to this, the model allows an easy derivation of the amount of time spent (4.3.4) in each submodel. This, in turn, enables a full characterization of the behavior of statistics such as average time to fixation and its variance. By exploring a variety of demographic scenarios in the exact and stochastic models, we characterized new limitations to the harmonic mean effective population size. We first examined the behavior of the harmonic mean effective population size under a fluctuating population model, 4.2. In this scenario, the harmonic mean has been proven [57] to be appropriate, which we confirmed empirically. However, under non-reversible population size change, such as under a demographic history with varying population size, the harmonic mean is only an approximation (and can sometimes be a relatively poor one). In particular,

84 higher moments of various statistics like fixation time can be poorly approximated by HM, and it provides an especially poor approximation for non-neutral variants. We also expect that AFS computed under a harmonic mean approximation could miss significant features of AFS under a fully time-heterogeneous model (not shown). For computation of expected allele frequency distributions in the MMWF framework, we found that the stochastic method that computes the AFS as absorbing probabilities at present day is fast, and scales well with large population sizes (especially when coupled with approximations to large population sizes in the demographic history). We show that both of these methods produce consistent AFSs, which diverge from a diffusion approximation under an otherwise equivalent model [45]. We were unable to fully determine the reason for the disagreement between methods, but suspect that this is a result of some of the different assumptions and approximations made by the approaches.

4.6 Supplement

4.6.1 Supplementary methods

Summarizing statistics for time-heterogeneous model

PK As in the main text, we here let B be the x (2Nx −1)×2K matrix of conditional absorption probabilities, with columns corresponding to extinction or fixation in each submodel, and rows corresponding to the entire transient state-space of the MMWF across both models and allele counts. We also let N be the fundamental matrix of the entire MMWF, which has PK PK dimensions x (2Nx − 1) × x (2Nx − 1) and is unconditional on which absorbing state is reached. When we assume that we deterministically start in one model regime and are interested in properties of absorption in a single model regime, computations are the same as in the single population case if we restrict our attention to the appropriate row of the appropriate

85 submodel of the relevant matrix. However, when we are interested in properties where the starting and terminal model regimes are uncertain, appropriate sums need to be constructed across rows (for the starting model) and potentially across columns (for the terminating model). The details of these summations depend upon the statistic being computed. PK If absorption is allowed in each submodel, we can construct a new x (2Nx − 1) × 2 matrix, Bˆ , by summing over all the columns corresponding to the equivalent absorbing state in any of the submodels. That is, we set:

ˆ B = [B(·,ext), B(·,fix)] (4.6.1) K X B(·,ext) = B(i,k) k∈extinction K X B(·,fix) = B(i,k) k∈fixation

We can thus use Bˆ to compute properties of the model when extinction or fixation occurs in any submodel.

Probability of absorption Now we show how to compute a single expression for the probability of fixation by considering each possible starting and absorbing component model.

Using P (0)i as the vector of initial probabilities across all states in the full state space of MMWF, and labelling the columns of Bˆ as k ∈ {Ext, Fix}, by the law of total probability we have:

P(2Nx−1) x X ˆ PFix(P(0)) = P (0)iBi,Fix (4.6.2) i

In practice, when we let initial allele counts be determined by the mutation model, we split the summation into two nested summations over model state m and initial allele

86 count x, so that we can terminate the interior summation over allele count states when the corresponding initial probability falls below a small threshold [2]. The probability of

extinction can be computed in the same way, or simply as 1 − PFix.

Time to absorption We similarly derive conditional sojourn times where we condition on the class of absorbing states we terminate in (e.g., either extinction in any submodel, or fixation in any submodel). Here, we integrate over all K starting submodels, and over all K terminating submodels, and compute a fundamental matrix for absorption in absorbing ˆ ˆ class k, Nk, that we index using Nk(i,j) for initial allele count i and transient allele count state j. In other words, this expression gives the expected time spent in allele count state j on the way to absorbing class k, conditional on starting in allele count i (in any submodel). For simplicity of notation, we use a function to convert allele count, x, and model state m, to the corresponding state in the full model state space, i, F(x, m) = i. Similarly, we use the inverse function F −1(i) = (x, m) which returns the corresponding allele count state x and model state m for state i in the full state-space. We will also assume the availability of an initial probability distribution across model states m, π(m), described in the methods section of the main text. By the law of total expectation we have:

K K ˆ X X B(F(j,n),k)N(F(i,m),F(j,n)) Nˆ = π(m) (4.6.3) k(i,j) ˆ m n B(F(i,m),k) As before, we can then use equations (4.3.4) and (4.3.6) to calculate the mean and variance of these statistics for the Markov-modulated model.

4.6.2 AFS approximation scaling

The approximation described in 4.3.3 has a moderate effect on the precision of the calculation 4.11 at the benefit of faster calculation speed. Table 4.3 shows the number of seconds each AFS calculation takes to complete. The approximating factor of 10 appears to provide an acceptable amount of error, and a relatively quick computation time.

87 Approximating factor Stochastic model runtime, s 100 11.73 90 13.62 80 16.10 70 16.37 60 17.30 50 20.52 40 23.34 30 33.28 20 45.58 10 105.24 5 265.12 2 847.88 1 1667.54

Table 4.3: Computation time for the AFS using different approximating factors. The measurements are in wall-clock seconds, taken on an 8-core Intel processor.

4.6.3 Supplementary figures

88 4000 t=1107

3500

3000 c=2859 p=0.71475

2500 Population size

2000 N1 = 1000

N2 = 2000

Allele counts c=1416 1500 p=0.708

1000

500

0

0 500 1000 1500 2000 2500 Generations

Figure 4.8: Example simulation of increasing population size from N1 = 1000 to N2 = 2000. A neutral allele starts in a single copy in population N1, and switches to N2 at t = 1107 generations. The red line indicates a trajectory in N1, blue line - N2. At the time of the switch (vertical dashed line), the count of the allele approximately doubles from 1416 to 2859.

89 0.700

0.675

0.650

0.625

0.600

0.575 Coefficient of variation time to fixation

0.550

100 101 102 103 104 105 106

Expected number of generations in N1

Figure 4.9: Coefficient of variation CV = sd(TFix) for the time between fixation in Fig. 4.3A. E(TFix) The maximum CV occurs when PFix(N1) ≈ PFix(N2) (vertical line), calculated from data in Fig. 4.3B.

90 1.3

1.2 ) M H (

x 1.1 i F P / ) F W M

M 1.0 ( x i F P

0.9

0.8

4 2 0 2 4 2Ns

Figure 4.10: Ratio of probabilities of fixation between MMWF and HM in Fig. 4.4. For positively selected and nearly neutral variants (2Ns > −2), MMWF predicts a higher probability of fixation.

91 Approximating factor 10 1 1 2 5 10 3 10 20 30 10 5 40 50 Probability 60 70 10 7 80 90 100 10 9 0.000 0.001 0.002 0.003 0.004 0.005 Allele frequency

Figure 4.11: Full allele frequency spectrum when using different levels of approximation. Probabilities (y-axis) are shown on a log scale. Population sizes N = (1000, 500, 100000), epoch lengths t = (400, 200, 100), selection coefficients 2Nancs = 0, per-generation mutation rate µ = 1 × 10−9.

92 Chapter 5

Molecular Substitution Rate with Standing Genetic Variation and Recurrent Mutation

Contributions

JdK and IK developed the method, IK implemented the method, IK and JdK analyzed the data and wrote the paper. This chapter is under preparation for submission to Genetics or MBE.

5.1 Abstract

Much standard molecular evolutionary theory describes adaptation in a mutation-limited regime, where individual mutant alleles arise and sweep to fixation one by one [12]. However, increasing evidence suggests that populations may frequently undergo rapid adaptation to changing environments through selection on standing genetic variation (SGV ) or on recurrent mutations (RM ) in large populations [40]. In this chapter, I model the rate of adaptive molecular substitution while accounting for SGV and RM. I develop an extended Wright-

93 Fisher model based on the Markov-Modulated Wright-Fisher formalism introduced in the previous chapter, to directly represent variation that exists in a population before the onset of positive selection. This is accomplished by allowing the population to accumulate neutral variation for a variable number of expected generations before it irreversibly switches into an absorbing Wright-Fisher model of positive selection. Our results show and quantify the acceleration of the rate of adaptive substitution with standing genetic variation. Importantly, this approach relaxes the assumption of mutation-selection balance prior to the onset of positive selection, while maintaining computational efficiency (requiring computation of only a single row of the fundamental matrix, regardless of the average time spent accumulating neutral variation). We show that the model predicts that it might take an exceedingly long time for biological populations to reach the equilibrium distribution of allele frequencies described by mutation-selection balance (except when variants are highly deleterious), and highlight the need to revisit whether mutation-selection balance is really an effective model of real populations.

5.2 Introduction

The rate of adaptive substitution is often limited by the amount of genetic variation available for natural selection to act on. If beneficial mutations are rare, the rate of substitution will be bounded by the time spent waiting for new mutations. This mutation-limited view has been integral to many molecular evolutionary models and methods (e.g. [66, 12, 67]). On the other hand, it is possible that adaptive mutations are present in a population as standing genetic variation (SGV ). For example, neutral mutations can become established in the population before the onset of positive selection. Mutations can also be introduced into the population multiple times via recurrent mutation or migration. When adaptation proceeds by selection on either standing genetic variation or recurrent mutation, it is possible for substitutions to occur via soft sweeps [68]. Recent work has shown that soft sweeps are

94 likely ubiquitous in nature, with numerous examples in Drosophila [69, 70, 71], and humans [72, 73, 74, 75]. Adaptation from SGV is particularly pertinent in pathogens, where evolution of drug resistance is often associated with presence of resistance variants before the onset of treatment. Multiple cases have been described in HIV drug resistance [76, 77, 78], see Pennings et al.[79] for review. P. falciparum is another well-studied example [37, 80]. The concept of a soft sweep was introduced by Hermisson and Pennings [68]. They defined a soft sweep as resulting in the fixation of an adaptive allele, where the ‘fixed’ population may possess adaptive alleles with multiple origins. A soft sweep thus generally leaves the population identical by state, but not identical by descent. In contrast, a hard sweep is said to occur when the population is both identical by state and identical by descent. Traditional molecular evolutionary models, which implicitly assume evolution is mutation- limited, all assume substitutions represent hard sweeps. Given, however, that the majority of known cases of selective sweeps are actually believed to be soft sweeps (see above), the focus of molecular evolutionary theory on mutation-limited scenarios may represent a critical mismatch between models of molecular evolution and the populations they are meant to represent. Soft sweeps have been extensively described and studied in population genetics [81, 41, 82, 38]. Much attention has been devoted to the detection of soft sweeps from variation surrounding the selected locus. Soft sweeps do not have the strong variation-depleting sig- nature of a hard sweep [83], and they have multiple haplotype backgrounds, making them harder to detect in selection scans. Ongoing effort continues to quantify the proportion of adaptation that may be due to soft sweeps [84, 74]. To model adaptation from standing genetic variation one needs to assume some initial conditions with respect to the distribution of allele frequencies. In standard phylogenetic substitution models (including mutation-selection models), it is always assumed that sub- stitutions represent a completely wildtype population instantaneously becoming fixed for a new state [10]. Even when time heterogeneous selection is considered, it is usually considered

95 such that populations are assumed monomorphic when selection changes regimes. We note this is fairly unrealistic in the presumably common case where functionally exchangeable alleles are being generated and allowed to segregate over time under negative selection (in one selection regime). When selection changes to a new regime, the starting conditions may matter quite a bit, since previously isofunctional alleles may have reached establishment frequencies before the onset of positive selection. In this instance, the time to the next substitution will be much less when the newly-adaptive allele starts in a single copy. Previously developed methods in population genetics typically account for variation in starting conditions by assuming that allele frequencies are distributed according to mutation- selection equilibrium [81, 39, 68, 85]. Here, we show that under neutrality and positive selection, the time to convergence (TC ) on the equilibrium distribution can be exceedingly long. We develop an efficient computational method to examine the approach to equilibrium with respect to different population genetic parameters. The results shows that an alternative approach may be necessary to realistically model initial allele frequency distributions. We build on our previous work on Markov-modulated Wright-Fisher models (chapter 4) to construct a model of standing genetic variation. We explicitly consider the amount of variation present after an expected number of generations τ since the last sweep. This is done by extending the Wright-Fisher model with an explicit pre-adaptive epoch, during which the focal allele is not necessarily beneficial. This model is non-absorbing with respect to extinction and fixation, but undergoes an irreversible switch into a regime of positive selection after an average of τ generations. Our model naturally integrates over the distri- bution of allele frequencies acquired after a given time, relaxing the equilibrium distribution assumption. Note, however, that for a large enough τ, our model allows convergence on the equilibrium distribution of allele frequencies. For computational tractability we focus on a one locus biallelic model. The biallelic assumption is obviously less desirable when considering phylogenetic substitution processes. Generalization of these approaches to the multiallelic case is therefore left for future work.

96 We compare three models of the substitution rate across species - mutation-limited diffu- sion approximation [14], recurrent mutation (multiple-origin RM - [17]), and recurrent mu- tation with standing genetic variation (multiple-origin RM+SGV ). The RM+SGV model includes the mutation-limited model as a special case, as it includes adaptation through both single and multiple origin selective sweeps. Unlike in much work on soft sweeps [86], we are primarily concerned with understanding the rate of adaptation, and not the characteristics of coalescent genealogies.

5.3 Methods

Here we describe our approach for calculating time to convergence (TC ) on the equilibrium distribution of Wright-Fisher models. We propose an algebraically exact, but computation- ally demanding method, as well as an approximate method, which is much more efficient.

5.3.1 Rate of approach to equilibrium

Consider a finite non-absorbing, ergodic Markov chain with a transition probability matrix, P. While we focus our attention on the Wright-Fisher Markov model with bidirectional mutation and no absorbing states, the theory in this section applies equally well to any finite ergodic (irreducible) Markov chain. The Wright-Fisher model with bidirectional mutation and v > 0, u > 0 (forward and backward mutation rates, respectively) is believed to generally be ergodic, though we do not prove this here [50, 58, 5]. To compute how the probability distribution of allele counts is expected to change in a single generation given a starting distribution f0 we can compute:

f1 = f0P (5.3.1)

The expected distribution in τ generations is then:

97 τ fτ = f0P (5.3.2)

τ As τ tends to infinity, for any initial distribution f0, the product f0P converges on the limiting distribution π:

τ π = lim f0P (5.3.3) τ→∞ = πP (5.3.4)

When the chain is ergodic and positive recurrent, the limiting distribution is also a stationary distribution. It is not entirely clear if Wright-Fisher models are always positive recurrent with arbitrarily strong selection, so we will focus on cases where selection is only moderately strong. In this case, π is a time-invariant stationary probability distribution with respect to P, and is the left eigenvector with an eigenvalue of 1. The stationary distribution of any ergodic Markov chain with a given transition probability matrix can be calculated by a variety of methods [87], and we describe a numerically stable method owing to Paige et al.[87] in the supplement [5.6.1]. While the ergodicity of the chain guarantees eventual convergence on π, we do not know how long we need to wait for the equilibrium distribution to be attained. To solve for the number of generations, G, it takes to reach π, we need to define a measure of distance, d

τ between π and the distribution at time τ, f0P . The total variation distance (TVD) is an appropriate distance measure between two probability distributions, and is often used in the study of convergence rates of Markov chains. It is defined as:

1 X d(π, f Pτ ) = (|π − (f Pτ ) |) (5.3.5) 0 2 i 0 i i The maximum TVD between two distributions has the value of 1, and equals to 0 when

98 the distributions are identical. It can be intuitively interpreted as a percentage of absolute distance between distributions. By (5.3.3) should decrease with increasing τ:

τ lim d(π, f0P ) = 0 (5.3.6) τ→∞

Note that we explicitly keep f0 in (5.3.5). A possible way to monitor approach to equi-

τ librium is to iteratively calculate f0P until TVD is sufficiently close to 0. We also note that the value of d monotonically, with every iteration. This gives an explicit solution without an approximation. While this approach can be optimized due to the sparsity of P of the Wright-Fisher model (at machine precision), the calculation is computationally expensive. To claim that a chain has converged to the equilibrium distribution after G generations, we need TVD (5.3.5) to be close to 0. In practice, instead of defining a cutoff in terms of

τ value of d, we monitor its derivative. As f0P approaches π, the derivative of d approaches 0. For a stopping criterion, we then require that the difference ε between successive d(τ) and d(τ + 1) is below a value close to machine epsilon. We found that the solution is sensitive to the exact value of ε chosen. However, we are primarily interested in the order of magnitude of TC, and use ε = 1 × 10−14 in the applications below. An alternative method to find the time to convergence is to consider the spectral decom- position of P = EΛE−1, where Λ is a diagonal matrix of eigenvalues, and E is a matrix where every column component is an eigenvector. If the spectral decomposition of P is known, the exponentiation operation is cheap. The rate of approach to equilibrium is determined by the eigenvalues with the largest magnitude. We describe this method in detail in the supplement [5.6.2]. We now turn our attention to deriving formulations of the substitution rate between divergent species from the Wright-Fisher model.

99 5.3.2 Rate of substitution in the Wright-Fisher model

The Wright-Fisher (WF ) model of population genetics describes a single bi-allelic locus in a panmictic population of diploid organisms of fixed size N. We name a to be the ancestral allele, and A to be the derived allele. The model provides a transition probability matrix

Pi→j, denoting the probability of transition from i to j copies of allele A as a binomial distribution:

2N P = (ψ )j(1 − ψ )2N−j (5.3.7) i→j j i i

where ψi is the binomial sampling probability, which may include the contribution of mutation, selection, and dominance. Equation 5.3.7 produces a discrete time Markov chain with 2N + 1 states. State 0 corresponds to the extinction of allele A, and state 2N to fixation of A. When computing the rate of substitution, we desire a general expression that does not assume infinite-sites. To accomplish this, we only consider the fixation state as absorbing - the model can enter the extinction state any number of times and later leave this state due to forward mutation [17]. This allows the time to absorption starting with zero mutant copies to give us the desired time to the next substitution, TSub, under the assumption of finite-sites. Kimura called this the time between fixations, which he derived under the assumption of infinite-sites [88]. Through this construction, it is also possible to fairly easily compute the entire probability distribution of the time to the next substitution, including its moments, using standard Markov chain methods (Chapter 6). Note that this solution does not make assumptions about the strength of selection, mutation, or dominance. The generality of this approach allows us to check for robustness of assumptions in other methods. These approaches are implemented in version 2 of our Wright-Fisher Exact Solver software, available at https://github.com/dekoning-lab/wfes2. Following Kimura, we define the expected rate of substitution, k, as:

100 1 k = (5.3.8) TSub

5.3.3 Modelling single-origin selective sweeps

Single origin (hard) selective sweeps occur when mutation in the population is so rare that the time it takes for a mutation to segregate to its ultimate fate is much shorter than the expected time to a new mutation in the population [41] (as also described by [10, 89, 88, 12]). In this case, the rate of adaptation is determined by the time it takes for a lucky mutation to arise that eventually sweeps to fixation. One common way that mutation-limited adaptation is modelled is by disallowing new mutations during the segregation of an initial allele in the population. Following Golding and Felsenstein[10], who first adapted Kimura’s infinite-sites rate of substitution for finite-sites with weak mutation, this yields:

k = 2NvP rfix (5.3.9)

1 − e−s P r ≈ (5.3.10) fix 1 − e−2Ns

where N is the number of reproducing individuals in the population, 2Nv copies of A are expected to be generated in each generation at each site, and P rfix is the probability of fixation for a single copy of A on a background of a. Note this is the basis for the formulation of mutation-selection models of codon substitution [10, 11] and differs from Kimura’s infinite- sites rate of substitution, which uses V , the per genome rate of mutation [88], instead of v, the per site rate of mutation. This is an important distinction because when modelling phylogenetic substitution processes we are almost always explicitly interested in computing quantities per site. In this case, the infinite-sites justification is inappropriate.

101 5.3.4 Modelling multiple- and single-origin selective sweeps by re-

current mutation

We use the computational approach described above to compute TSub under a Wright-Fisher model with recurrent mutation and selection RM, [17]. We set up the transition probability matrix for the WF model according to 5.3.7, and only allow the fixation state to be absorbing. The expected time to the next substitution is then calculated as the mean time to absorption using standard methods (equation 4.3.4). Note that in this formulation, sweeps can be hard or soft and the proportion of each will be determined by the population mutation rate [41].

5.3.5 Modelling multiple- and single-origin selective sweeps by re-

current mutation and standing genetic variation

To model standing genetic variation and recurrent mutation (RM+SGV ), we require a dis- tribution of allele frequencies prior to the onset of positive selection. The standard approach [81, 39, 68]) has been to assume that the allele initially exists at the mutation-selection equi- librium (e.g., for a neutral allele; also see [90, 85]). Instead, we consider a Markov-modulated Wright Fisher model (MMWF; Chapter 4) to explicitly describe an epoch before the onset of positive selection (the “pre-adaptive” phase), where the focal allele is neutral or mildly deleterious (s0 ≤ 0), and the model is non-absorbing with respect to extinction and fixation. After an average of τ generations, the model irreversibly switches into a model of positive selection (s1 > 0), where fixation is an absorbing state. The longer the model spends in the pre-adaptive epoch, the closer the initial allele frequency probability distribution in the adaptive model gets to the mutation-selection balance distribution under s0. This gives two advantages over explicitly assuming mutation-selection balance. First, it allows substitution rates for any amount of time spent in the pre-adaptive epoch to be computed from a single row of the fundamental matrix of the MMWF. Second, it allows explicit consideration of biologically meaningful timescales on the distribution of standing genetic variation.

102 The full transition probability matrix for this MMWF model can be written as:

  (1 − 1 )P 1 P  τ 0 τ 01  P =   (5.3.11) 0 P1

where P0 represents a WF model for the pre-adaptive epoch, P1 is an absorbing WF model of the adaptive epoch, and P01 is the switching matrix between P0 and P1 (see Chapter 4; also

Supplementary Methods). When we are interested in computing the substitution rate, P1 has only one absorbing state for fixation, as described above. The full transition probability matrix thus has (2N +1)+2N rows and (2N +1)+2N columns. We can find the fundamental matrix of this Markov chain (equation 2.3.2) using standard methods (Chapter 2). The first 2N + 1 columns of N correspond to the expected number of generations spent in the pre- adaptive epoch (which by construction sum to τ), and the last 2N columns correspond to the number of generations spent in the adaptive epoch prior to eventual fixation. It should be noted that the main computational limitation of this approach is the depen- dence of the state space on the population size N. Larger transition probability matrices require more computational resources to calculate the solutions to 2.3.12. Through careful optimization, we were able to achieve good performance for realistic population sizes.

5.3.6 Simulations

The calculations above where validated against Wright-Fisher simulations (see Supplemen- tary Methods). Briefly, the RM scenario was simulated by starting with 0 copies of A, and repeatedly sampling the next generation count i from 5.3.7, until i = 2N. For RM+SGV simulations, we simulate a neutral allele for an average of τ generations, without absorption. The number of generations τ is a random variable, so at every generation we drew a Bernoulli random variable with success probability 1/τ, in order to decide if we should switch into the adaptive epoch. Once we are in the adaptive epoch, we simulated the allele frequency trajectories until absorption (i = 2N), exactly as for RM.

103 We additionally kept track of how many times an allele trajectory went to the absorbing state 0 for both RM and RM+SGV simulations.

5.4 Results

5.4.1 Approach to equilibrium

1.0 2Ns -10.0 -9.0 -8.0 0.8 -7.0 -6.0 -5.0 -4.0 -3.0 0.6 -2.0 -1.0 0.0 1.0 0.4 2.0

Total variation distance 3.0 4.0 5.0 0.2 6.0 7.0 8.0 9.0 10.0 0.0

100 101 102 103 104 105 106 107 Time, generations

Figure 5.1: Time to convergence starting in a monomorphic population, for different selec- tion coefficients. The deleterious alleles converge fastest, the neutral alleles (green, dashed) 1 take longest to converge. X-axis on a logarithmic scale. Low mutation rate (θ = 1000 ), medium population size (N = 1000), ε = 1 × 10−14.

We first consider the expected time in generations it takes for a single Wright-Fisher pop- ulation to approach its equilibrium distribution as a function of selection, following the last selective sweep at the position of interest. As expected, Figure 5.1 shows that total variation distance (TVD) from equilibrium decreases over time since the last sweep (when starting in a monomorphic wildtype population with mutant allele frequency 0). For positively selected variants (red), TVD decreases from near the theoretical maximum TVD of 1, down to about 0 within 5 × 106 generations. Since the mean time to the first mutation in this example is

104 1/v = 4×106 generations, this means that convergence of weakly positively selected variants may still require an average of about 1 million generations after the expected appearance of the first mutation (note the log scale). The neutral variants (green, dashed) take the longest to converge, since TVD has not reached zero even by 107 generations. In this case, this corresponds to 40 to 50 million generations to convergence, after the expected appearance of the first mutation. Deleterious variants (blue) start close to their equilibrium distribution (mutation-selection balance, see Fig. 5.8), as indicated by small starting TVD. How these results are interpreted depends in part on how small we think TVD should be for a population to be sufficiently converged. In subsequent analyses, we call the population converged when the change in TVD per generation falls below ε = 1 × 10−14. Changing this arbitrary value will produce different values of the time to convergence (TC ). However, the order of magnitude of the time required to reach equilibrium does not change for any value of TVD that may be called small. For neutral variants, between 5 and 50 million generations are required on average for the allele frequency distribution to resemble the equilibrium distribution, π (Fig. 5.8). Figure 5.2 clarifies the effect of selection and shows the required TC given the decision criterion of ε = 1 × 10−14. Here we clearly observe that the maximum TC is for neutral variants. The fastest convergence is observed for strong negative selection. Increasing posi- tive selection is also seen to decrease the required TC, but since the starting monomorphic population is far from π (Fig. 5.1), convergence still takes a large amount of time. We also compared the exact and spectral method for calculating the approach to equilib- rium, to insure that they produce compatible results (Fig. 5.9). The two solution methods give different values of TVD (since they measure divergence from π and πE respectively), but both converge at d ≈ 0 at similar values of G. Figure 5.3A shows TC for different mutation rates, assuming a constant population size.

c As expected, the TC decreases with increasing mutation rate, G ≈ v (dashed line). The stopping criterion ε determines the value of the constant c, which is approximately 6.8 in

105 107

106

105

104 Time to Convergence

103

102 200 150 100 50 0 50 100 150 200 2Ns

Figure 5.2: Time to convergence starting in a monomorphic population, for different selec- tion coefficients. The figure shows the same data as figure 5.1, with time on the Y-axis, for a wider set of selection coefficients. The deleterious alleles converge fastest, the neutral alleles 1 take longest to converge. Low mutation rate (θ = 1000 ), medium population size (N = 1000), ε = 1 × 10−14.

106 A B 108

107 convergence to Time 106

0 5.0e-3 1.0e-2 1.5e-2 2.0e-2 2.5e-2 0 10000 20000 30000 40000 50000 θ N

Figure 5.3: Time to convergence for (A) variable mutation rate, neutral variant (2Ns = 0), medium population size N = 1000, ε = 1 × 10−14). (B) variable population size, neutral variant, small mutation rate (v = 2.5 × 10−7), ε = 1 × 10−14). Red line shows the time to c convergence. The dashed line show an approximation v , where c is an arbitrary constant. Time to convergence decreases with increasing the mutation rate.

107 Fig. 5.3. We do not pursue an exact expression here, but simply aim to show the strong effect of mutation. For the maximum mutation rate, which corresponds to 4Nv = θ = 0.1, the TC value was 3.8 × 105, which is still long. Interestingly, if we vary N, while keeping v constant (Fig. 5.3B), TC appears to be relatively unchanged by different values of the population mutation rate parameter, θ. Nu-

merically, the leading non-unit eigenvalue of the model is λ1 ≈ 1 − 2µ, which shows that the raw mutation rate is the chief determinant of the approach to equilibrium. One aspect for possible future exploration would be to consider the rate of approach to equilibrium under time-variable population size (as suggested by Dr. Jeffrey Thorne).

Summary

For deleterious variants, a relatively small amount of time is required to approach equi- librium. This implies that mutation-selection balance under negative selection should be relatively common in nature. The number of generations required to reach equilibrium un- der neutrality, however, is comparatively very large. This implies that mutation-drift balance is expected to be hard to attain in nature, since it would require that population parameters remain unchanged for a very long period of time (millions to tens of millions of generations). Interestingly, under positive selection, the distance from a monomorphic population to an equilibrium population is large. This means that TC is also large under positive selection. With increasing magnitude of 2Ns, the required time is reduced, but it remains signifi- cantly longer than for deleterious alleles. These results suggest that the assumption of an equilibrium allele frequency distribution may be too restrictive for modelling real biological populations.

108 5.4.2 Finite-sites substitution rate with bidirectional mutation,

selection, and SGV

We implemented three substitution rate computations for comparison: 1) Golding and Felsenstein’s finite-sites, weak mutation rate of substitution assuming mutation-limited evo- lution and hard sweeps (k = 2NvP rfix); 2) direct computation of one over the finite-sites rate of substitution from a Wright-Fisher Markov model with recurrent bidirectional mutation (RM ; allowing hard or soft sweeps); and 3) direct computation of one over the finite-sites rate of substitution as in (2), but with standing genetic variation determined by neutral evolution for an expected number of generations, τ (RM+SGV ; allowing hard or soft sweeps). Figure 5.4 shows substitution rate comparisons for a range of selection coefficients and mutation rates. When SGV was considered, we initially used a large value of τ (109 gen- erations) so that the initial allele count probability distribution at the onset of positive se- lection reflected mutation-drift balance. As expected, adaptation under recurrent mutation and standing genetic variation (RM+SGV ) is faster than when standing genetic variation is ignored (c.f., mutation-limited and RM models). This effect of SGV is consistent with expectations and is compatible with predictions from previous work [68]. Under these con- ditions, in a human-like population (N = 10, 000, θ ≈ 0.001) with 2Ns1 = 10, the rate of RM+SGV substitution is 1.98 times greater than the mutation-limited rate of substitution, and is 1.97 times greater than the model that includes soft sweeps under recurrent mutation only (RM ). As the amount of time spent accumulating neutral variations is decreased, the rate of substitution also declines (Fig. 5.5). When either mutation rates or the time spent accumu- lating neutral variation is large, the rate of substitution can be as much as 2× higher with standing genetic variation compared to without. These results suggest that SGV can have a significant effect on both the expected time to the next substitution, and, as a result, on the rate of substitution. As we discuss later, the magnitude of this effect will be determined by τ, which is in turn determined by the timescale over which selection changes.

109 2Ns1 = -2 2Ns1 = 0 2Ns1 = +2 3e-05 -1 ations r 2e-05 ate, gene r 1e-05

Substitution 0e+00 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5

θ = 4Nv Mutation-limited Recurrent mutation (RM) Recurrent mutation + standing genetic variation (RM+SGV)

Figure 5.4: Standing genetic variation increases the corresponding rate for the next substi- tution. A very long time spent accumulating neutral variations is assumed (τ = 109) so that mutation-drift equilibrium is implied at the onset of positive (or negative) selection. The mutation-limited rate of evolution predicts a continual increase in the rate of substitution for increasing θ. The rate of increase in the substitution rate slows down with increasing θ under RM+SGV and RM. With θ > 0.3, the RM+SGV substitution rate even starts to slow down. RM - recurrent mutation, SGV - standing genetic variation. Y-axis is on a log 9 scale. N = 10000, 2Ns0 = 0, τ = 10 .

110 1e+10

τ 1e+07

1e+04 1e−05 1e−04 1e−03 1e−02 1e−01

θ = 4Nµ

SGV+RM 1.25 1.50 1.75 2.00 RM

Figure 5.5: Ratio of rates of substitution, caused by varying the extent of standing genetic variation and mutation. When mutation rates are low and the time spent accumulating neutral variation is small, little standing genetic variation is generated and the rates of substitution are relatively unchanged compared to a model without SGV. When the time spent accumulating SGV and/or the mutation rates are large, SGV can increase substitution 4 rates by up to 2 times. 2Ns0 = 0, N = 10 so that different values of θ correspond to different mutation rates (assuming forward and backward mutation rates are equal).

111 It is notable that with θ > 0.3 (Fig. 5.4), the RM+SGV substitution rate starts to actually decrease with increasing θ. This was also found under the finite-sites model with bidirectional RM only, but for larger values of θ [17]. This is most likely explained by an increasing importance of backward mutation as forward and backward mutation rates are increased together. This observation may be important biologically, as it suggests a natural limit to the utility of increasing mutation rates in populations that need to frequently adapt to changing environments. However, it should be noted that while values of θ have been estimated to be as high as 1.0 in HIV-1 [36, 79, 91], and are estimated to have a median close to 0.1 in RNA viruses [92], it is currently unknown whether such large values of θ are biologically relevant to many natural populations. Results with such large values of θ should thus be interpreted cautiously. One exception to this, where large values of θ are clearly important, is for microsatellites and other fast mutation types, which may often have corresponding values of θ that far exceed these numbers (see [85] and references therein).

5.4.3 Validation by simulation

We validated the substitution rate of our models with Monte-Carlo simulation (see sup- plement). The simulated rates of substitution agree well with the calculated substitution rates (Figure 5.6). RM and RM+SGV models agree with simulation averages over a wide parameter range. Full results are presented in the supplement.

5.4.4 Sojourn times prior to absorption

Since we explicitly solve for the fundamental matrix, we can investigate sojourn times prior to absorption for different parameter values. These are the expected number of generations spent in each allele count state prior to absorption. Figure 5.7 shows the expected number of generations spent in pre-adaptive and adaptive epochs at each copy number. For small values of τ, the majority of the time is spent at low allele frequencies in the pre-adaptive epoch. Increases to the mutation rate increase the amount of time spent at higher allele frequencies.

112 4e-4 ●

● -1

ation 3e-4 ● r

● ●

2e-4 ● ● ate, gene ● r ● ●

● ● ●

1e-4 ● ●

● ● ● Substitution ● ●

● ● ● ● 0e+0 ● -2.5 0.0 2.5 5.0 7.5 10.0

γ = 2Ns1 Mutation-limited Recurrent mutation (RM) Recurrent mutation + standing genetic variation (RM+SGV)

Figure 5.6: Comparison of simulated versus calculated substitution rates under different selection strengths. Hard (mutation-limited) sweeps in red, hard or soft sweeps in blue with RM only, hard or soft sweeps in green with RM+SGV. Lines show calculated rates from WFES2, points show average of WF simulation averages (10, 000 replicates per point). 5 Pre-adaptive mode simulated with neutral allele (s0 = 0), N = 100, θ = 0.1, τ = 10 .

113 For large values of τ, the number of generations spent in the pre-adaptive epoch qualitatively resembles the shape of the mutation-drift equilibrium distribution (Fig. 5.8). This is due to the fact that the sojourn time function is essentially rescaled version of the probability distribution, where the measure is in terms of expected number of generations spent, and not the probability. Intuitively, with increased τ, the model approaches the mutation-selection equilibrium, and the expected time spent in each state reflects that.

θ = 0.001 θ = 0.01 θ = 0.1 1e+08

1e+03 τ = 1000 1e−02

1e−07

1e+08 τ = 1e+10 1e+05

1e+02

Pre−Adaptive Adaptive

Figure 5.7: Number of generations spent in each epoch prior to fixation in the terminal epoch. Horizontal gray line at t = 0. Red curve shows the expected number of generations at each copy number in the pre-adaptive epoch. Blue line shows the expected number of generations at each copy number in the adaptive epoch. With small τ, the majority of time in the pre-adaptive epoch is spent at low allele frequencies (predominantly zero). With large τ, the number of generations spent at each copy number resembles the equilibrium distribution.

114 5.5 Conclusions

We developed an extension to the Wright-Fisher model in order to describe the rate of sub- stitution resulting from hard or soft sweeps, including with arbitrary amounts of standing genetic variation. Our model explicitly accounts for the strength of selection and the amount of available segregating variation as a function of time. Using our approach, we show that given a sufficient amount of standing genetic variation, the expected time to the next substi- tution is expected to be up to 2× smaller than when a population begins in a monomorphic state. This acceleration is independent of the strength of positive selection when the mu- tation rate is small. With larger mutation rates, stronger positive selection leads to greater rate accelerations. Our results also suggest that to reach mutation-selection equilibrium, an exceedingly long amount of time might be necessary, except for deleterious variants which start much closer to their equilibrium distribution than do neutral or adaptive variants. We therefore argue that the widely used assumption of mutation-selection balance might not be a great model for natural populations. We expect that in natural populations, the amount of time spent accumulating neutral variations is largely determined by the timescale by which selection coefficients change. If selection changes relatively rapidly, neutral standing genetic variation may not have much time to accumulate. If selection changes very rapidly, so that fixation is unlikely before the environment changes again, the rate of substitution itself may become ineffective for quantifying the rate of evolution, and the rate of change in mean fitness of the population may serve as a more appropriate metric (e.g., [93]; also see Maynard Smith [94] for a unique perspective). If selection changes relatively slowly, or episodically (e.g., [95]), it seems very likely that the presumably long amount of time since the last selective sweep will allow substantial SGV to accumulate, which should ideally be accounted for in models that con- sider time-heterogeneity in selection and are applied to phylogenetic sequence comparisons. Our modelling approach allows this in principle, however, to be made useful for inference it would need to be generalized (or approximated) to allow interference among different types

115 in the multi-allelic case. In such an application, significant pre-computation would likely be required, as is fairly common in population genetics, even when using diffusion approxima- tions [61, 62]. These natural directions for the work in this chapter are of ongoing interest in our lab.

5.5.1 Data availability

All the data necessary for this investigation is provided within the paper. The source code is available at https://github.com/dekoning-lab/wfes2.

5.6 Supplement

5.6.1 Solving for equilibrium

To solve for the stationary distribution of a finite, ergodic Markov chain, we use the method described in [87] (see also [96]). The stationary distribution of a Markov chain is defined as a vector π, such that πP = π. For a model with n states, we can construct an n × n matrix Π, with π in each row. The matrix Π is a notational convenience which allows us to factor matrix-vector products. We can express this in the form:

ΠP = Π (5.6.1)

Π(P − In) = 0n (5.6.2)

where In and 0n are n × n identity and zero matrix respectively. Note that if we used πP = π form above, as opposed to Π, we would not be able to factor out π. Since π is a P probability distribution, we have i πi = 1, which we can enforce by setting the last columns

T of (P − In) and 0n to en = (1, 1,..., 1) . We use the notation r(A) to denote the operation of setting the last column of a matrix A by en. Then,

116 Πr(P − In) = r(0n) (5.6.3)

T T T r(P − In) Π = r(0n) (5.6.4)

Since all the rows of Π are identical, we only require to find one of them. Denoting A·,x as the xth column of A:

T T T r(P − In) (Π )·,x = (r(0n) )·,x (5.6.5)

The equation above has the form Ax = b, and can be solved by standard linear system solvers. In this work, we use INTEL MKL PARDISO[18] for this task.

100

10 2

2Ns 10 4 -10 0 10 Probability 10 6

10 8

0.0 0.2 0.4 0.6 0.8 1.0 Allele frequency

Figure 5.8: Example equilibrium distribution for different values of the population-scaled selection coefficient, 2Ns. The neutral distribution (black) is bimodal; for positive selection (red) the distribution is a right-heavy distribution; and for negative selection (blue), the majority of the probability mass is at low allele frequencies.

5.6.2 Approach to equilibrium via spectral decomposition

We start by writing that for some large value of τ = G, the Markov process has converged to π:

117 G f0P = π (5.6.6)

Then from (5.3.3) and the spectral form of P, we have:

G f0P = πP (5.6.7)

G −1 −1 f0EΛ E = πEΛE (5.6.8)

G f0EΛ = πEΛ (5.6.9)

G−1 f0EΛ = πE (5.6.10)

The matrix E−1 is not provided by the standard eigensolvers, and needs to be calculated separately. In order to avoid calculating the large matrix inverse, we cancel it from the

τ g−1 equation above. Since f0P approaches π, so does f0EΛ approach πE:

g−1 lim d(f0EΛ , πE) = 0 (5.6.11) τ→∞

Equation (5.6.11) is essentially a restatement of (5.3.3), but in terms of the eigenbasis of P. This provides an alternative way of solving for G where TVD is close to 0. The advantage of this method is that we do not require all the eigenvalues of P. For any stochastic matrix, the eigenvalues are decreasing, starting from 1. In relation to the Wright-Fisher model, the results of Cannings [97] provide insight to the eigenvalues, as reviewed in [5]. Efficient algorithms exist for numerically finding the leading eigenvalues of sparse matrices (Spectra [98]), which we use here. If we restrict ourselves to the leading l eigenvalues, the diagonal matrix exponentiation Λτ−1 in (5.6.11) becomes a cheap O(l) operation. In addition, we do not need to proceed in an iterative fashion, since any Λx can be calculated quickly.

118 0.5

0.4

0.3 Method Iterative TVD Spectral 0.2

0.1

0.0

101 102 103 104 105 106 107 108 Time, generations

Figure 5.9: Approach to equilibrium starting in a monomorphic population. Red line (iterative) shows the total variation distance of the allele frequency distribution at generation g from the equilibrium distribution. Blue line (spectral) shows the TVD of the spectral representation from πE. While the measures do not agree, they both converge to zero at the same time. The dashed line shows the convergence point (28.9 million generations). Neutral 1 variant (2Ns = 0), low mutation rate (θ = 1000 ), medium population size (N = 1000), ε = 1 × 10−14.

119 5.7 Wright-Fisher simulation code double psi_diploid( const long i, const long N, const double s, const double h, const double u, const double v) {

long j = (2 * N) - i; double w_11 = 1 + s; double w_12 = 1 + (s * h); double w_22 = 1; double a = w_11 * i * i; double b = w_12 * i * j; double c = w_22 * j * j; double w_bar = a + (2 * b) + c; return (((a + b) * (1 - u)) + ((b + c) * v)) / w_bar; } long sim_pre_adapt(long N, double mu, long tau) { long x = 0; double l = 1 / (double)tau; for(long t = 0; runif(0,1) < (1-l); t++) { double p = psi_diploid(x, N, 0, 0.5, mu, mu); x = rbinom(2 * N, p); } return x; } long sim_fix(long i, long N, double s, double mu) { long t; for(t = 1; i < 2 * N; t++) { double p = psi_diploid(i, N, s, 0.5, mu, mu); i = rbinom(2 * N, p); } return t; } long sim_sgv_rm(long N, double s, double mu, long tau) { return sim_fix(sim_pre_adapt(N, mu, tau), N, s, mu); } long sim_rm(long N, double s, double mu) { return sim_fix(0, N, s, mu); }

120 5.8 Validation by simulation

Figure 5.11 shows the means and standard deviation of the simulated time to the next substitution. For smaller selection coefficients, the standard deviation increases, since dele- terious alleles are less likely to fix directly. In figure 5.6, the population-scaled mutation rate θ = 0.1 was specifically picked to demonstrate the difference between models. Note that the single-origin diffusion approximation (Figure 5.6 red) does not agree well with either of the simulated results.

Distribution ofn umber of trials in RM vs RM+SGV

2000 count

1000

0 116.37 126.33 0 500 1000 1500 trials

Standing genetic variation Recurrent Mutation + recurrent mutation

Figure 5.10: Simulated number of extinction trials before fixation in RM and RM+SGV. Recurrent mutation in red, recurrent mutation and standing genetic variation in blue. Dis- tribution means shows as vertical lines and values below histogram. On average, stand- ing genetic variation allows for a larger number of direct fixations. N = 100, 2Ns1 = 1, θ = 4 × 10−4 , τ = 104, n = 2 × 104 replicate simulations.

We also simulated fixations trajectories from RM and RM+SGV models (Figure 5.10). If standing genetic variation is modeled, there is a smaller number of extinction events before a final fixation. In the simulation set shown in figure 5.10, 9.04% of trajectories go directly

121 to fixation in RM+SGV simulations, while only 0.76% go directly to fixation in RM model. This suggests that standing genetic variation allows a larger number of trajectories to go directly to fixation, avoiding a slowdown due to multiple extinctions.

5.9 Supplementary figures

2e5

1e5 ulated Fixation Time m

0 Mean Si

-2.5 0.0 2.5 5.0 7.5 10.0 2Ns

ModelRM RM+SGV

Figure 5.11: Means and standard deviations of simulated times between substitutions. s0 = 0, N = 102, θ = 0.1, τ = 105.

122 RM+SGV vs mutation-limited RM vs mutation-limited 10

5

1 0 2Ns

5

-10 1e-3 1e-2 1e 1e-3 1e-2 1e-1 - 1

θ = 4Nv Substitution 1/2 1 rate ratio 2

Figure 5.12: Rate deviation between evolutionary rates for a large parameter grid. s0 = 0, N = 104, τ = 109.

123 Chapter 6

WFES2: New models, computations, and improved performance in the Wright-Fisher Exact Solver

Contributions

IK, BD, and JdK developed the methods, IK implemented the methods, IK and JdK analyzed the data and wrote the paper. This chapter is under preparation for submission.

6.1 Abstract

Here we describe a substantially expanded and improved version of our versatile popula- tion genetics toolbox, Wright-Fisher Exact Solver [1]. WFES2 has been rewritten from the ground up for transparency and high performance, and it implements a wide range of more

flexible models and computations. WFES2 rapidly and directly computes properties of the Wright-Fisher Markov model of population genetics without simulation or diffusion theory. Models in the new version include selection, bidirectional recurrent mutation, dominance, time-variable population size, and time-heterogeneity in any parameters of the model (e.g.,

124 selection, mutation rates, etc.). This new version performs a variety of novel and stan- dard computations, and now supports rapid computation of entire probability distributions, including for the time-evolution of allele frequency spectra under both deterministic and stochastic model switching histories, conditional and unconditional sojourn times in any model, and for the probability distributions describing times to fixation, extinction, absorp- tion, and for the ‘finite-sites’ rate of phylogenetic substitution (i.e., for the expected time to the next substitution). Expected times, their variances, and probabilities of fixation, ex- tinction, and absorption are computed, together with our previously described method for computing arbitrarily high moments of the probability distribution describing allele age 3 (where the first three moments are implemented). All calculations can be integrated over an initial probability distribution, and arbitrarily high moments of the probability distribution describing the time to the next substitution can be rapidly computed by solving a series of linear systems of equations. Notably, all calculations are implemented to be algebraically exact, given the underlying

model. As a result, WFES2 makes none of the standard simplifying assumptions that are widely deployed throughout theoretical population genetics, including infinite-sites, weak mutation, weak selection, or small changes in allele count probabilities per generation. This feature is of particular interest, as it allows rapid, direct interrogation of model properties in parameter ranges that were previously relatively inaccessible, except by comparatively

tedious simulations. By focusing on Markov models, WFES2 avoids problems with handling absorbing boundaries that arise in diffusion theory methods, and it enables the rich theory of Markov processes to be easily applied to generate stable and accurate solutions for a vari-

ety of problems that are notoriously difficult using diffusion theory. WFES2 exploits sparsity of Wright-Fisher transition matrices (at machine precision), and employs state of the art parallel high-performance linear algebra routines, allowing consideration of population sizes up to 100,000 on most modern workstations. A variety of approximations can be option- ally used, which increase the applicability of the method. These include small population

125 approximations to larger populations, truncation of the rows of the transition probability matrix based on a probability mass threshold, and others.

Full source code is publicly available at https://github.com/dekoning-lab/wfes2.

6.2 Introduction

The Wright-Fisher Markov model of population genetics describes the dynamics of allele count changes in a finite population, which result from the interplay of random genetic drift and directional evolutionary forces [51, 50]. The model has seen extensive use in both evolutionary and population genetics, and its analysis continues to provide an important theoretical foundation to both fields. Despite its apparent simplicity, the model is not trivial to analyze, and several frameworks have been used for this purpose [99] including continuous- time diffusion approximations (e.g., [22]), and approaches that impose a backwards-time genealogical interpretation (coalescent theory; [100]). However, in these and most other approaches, a set of approximations are generally required in order to yield mathematically tractable results. Not all of these approximations may be well justified, since many were introduced long before significant data resources were available to illuminate the parameter ranges occupied by natural populations. Even with such approximations, however, it often remains difficult to include full consideration of the effects of mutation and selection, for example, in models of the multispecies coalescent [101].

The development of WFES was, in part, motivated by the recognition that many such ap- proximations were developed before modern computers were readily accessible, which meant that tractable and transparent mathematics were absolutely essential. Today, things are very different, and we can directly analyze the fundamental models of the field from a fresh perspective. WFES is intended to facilitate such a refreshed perspective, particularly by mak- ing it easy to explore the behaviour of new models under a set of standard computational approaches that do not make any assumptions about the underlying model or its parameter

126 values. Direct analysis of the Markov models of population genetics is hardly unheard of, and similar approaches to the basic framework we employ have been used in previous work (e.g., [5, 16]). However, Markov model methods are often discarded in population genetics in favour of diffusion approximations, owing to the perception that they are computationally inconvenient (or worse) and perhaps due to a perceived superiority of diffusion methods in a variety of circumstances. While we recognize that diffusion theory methods can be very illu- minating and are sometimes computationally advantaged, realistic models do not often yield closed-form solutions and thus require numerical integration, constituting approximations to approximations to the underlying models of interest (which are themselves approximations). As a result, sophisticated and powerful diffusion theory methods can sometimes be very sen- sitive to parameter choices and can thus seem to often break down. It is also worth noting that it is difficult, though not impossible [15], to properly account for absorbing boundaries in diffusion theory methods, which also can artificially become inaccessible in a diffusion approximation. Absorbing boundaries arise naturally in models that disallow mutation, so that extinction and fixation states can never be escaped once reached. The term ‘fixation’ itself owes its origin to the consideration of such models. In reality, however, mutation happens in a bidirectional and recurrent manner, so that fixation never really means a position is ‘fixed’ - except when it is maintained by unchanging selective constraints. Importantly, absorbing boundaries may also be imposed in more realistic models that include bidirectional recurrent mutation, as a contrivance to allow computation of quantities of interest, such as the time it takes for a beneficial mutation to spread throughout the population. WFES2 makes liberal use of this idea, and allows both the extinction and fixation boundaries to be treated as either absorbing or transient, depending upon the application.

One limitation of WFES2, which is shared by software packages having a few shared aims (e.g., SpectralTDF [45]), is its focus on one-locus biallelic models. In future versions we

127 will support multi-allelic and multi-locus models, although the space and time complexity engendered by such models requires even more creative approaches to be made feasible. The main reason why Markov model methods scale fairly poorly, to both multiple allele problems and realistic population sizes, is because the size of the transition probability matrix scales quadratically with population size (and scales even more poorly with more than one allelic type). However, the transition probability matrix of all such models is quite sparse at machine precision, and the fraction of non-zeros tends to decrease substantially for increasing population sizes. As a result, sparse matrix methods can enable sub-quadratic scaling of relevant computations [1, 2].

6.2.1 Available computations

In WFES2, numerous properties of interest are calculated automatically. These include the probabilities, expected times, and variances of times to fixation, extinction and absorption, as well as moments of the probability distribution of allele age [2]. By making the extinction state non-absorbing (transient), WFES2 computes the expected time to the next substitution, its variance, and the expected phylogenetic substitution rate. Following Kimura [88], the ‘rate’ of substitution is defined as one over the time to the next substitution (or as Kimura called it, the time between fixations). Notably, WFES2 computes a ‘finite-sites’ version of this quantity, and is thus the true rate of substitution desired in models of phylogenetic sequence evolution. By recognizing that the time to absorption in this one absorbing state model gives the time to the next substitution (e.g., starting from a monomorphic wildtype population), we recognize this quantity as having a discrete phase-type distribution [102], which allows WFES2 to apply some straightforward techniques to characterize and even fully compute the probability distribution of the time to the next substitution. As we show here, these distributions resemble geometric probability distributions but with a significant probabilistic delay. Markovian models of sequence evolution assume these quantities must be geometrically (or equivalently, exponentially) distributed, as is implied by the (strong)

128 Markov property of memorylessness. Interestingly, this result shows that substitution times in the Wright-Fisher model are non-Markovian (and, technically, are ‘semi-Markovian’).

WFES2 optionally calculates most quantities in models with time-heterogeneous parame- ters, such as variable population sizes (Chapter 4). Time-heterogeneity is implemented using piecewise-constant epochs, which may be specified explicitly for some applications (with fixed durations; discussed below), or implicitly, by using a stochastic switching model governing transitions between sub-models having different parameters or population sizes. This later

approach utilizes WFES2’s rapid computation of the fundamental matrix, and can thus com- pute allele count probability distributions over arbitrarily long demographic histories just as rapidly as for short demographic histories. Details are provided in Chapter 4. Similarly, a two epoch model is implemented for examining the rate of substitution with standing genetic variation (Chapter 5), wherein the first epoch accumulates neutral (or slightly deleterious) variation for a specified average number of generations, which may then be acted upon by selection after the onset of (usually) positive selection. This model also uses the fundamental matrix computation, and can thus compute desired quantities using only a single row of the fundamental matrix of a Markov-modulated Wright-Fisher model. It is therefore very fast.

WFES2 calculates transient, time-dependent properties of its models as well. For example, it computes the expected time spent at each allele count (sojourn times) on the way to either fixation, extinction, or absorption. The probability distributions for the time to absorption, fixation, extinction, and substitution can also be computed rapidly under most models, as described below. Although these distributions can mostly be approximated through numer- ical solution of diffusion equations, WFES2’s straightforward approach has the advantage of always working, regardless of the model or its parameter choices. The full listing of available calculations and how they can be run is shown in Tables 6.1, 6.2, 6.3, 6.4.

129 Mode Model Statistics wfes single Extinction and fixation are • Probability of fixation, extinction --absorption both absorbing states. Starting state is determined • Expected time to fixation, extinction, ab- by the forward mutation sorption rate or by user. • Variance of time to fixation, extinction, absorption wfes single Fixation is the only absorb- • Expected time to substitution (between --fixation ing state. fixations) Model can leave i = 0 state • Rate of substitution due to forward mutation. • Variance of time to substitution wfes single Extinction and fixation are • Compute entire fundamental matrix (ex- --fundamental both absorbing. pected sojourn time for every starting state) • Variance of the sojourn time (for every starting state) wfes single Neither extinction or fixa- • Equilibrium distribution of allele counts --equilibrium tion are absorbing. (mutation-selection or mutation-drift balance) wfes single Both extinction and fixa- • Expected age of an allele with a given ob- --allele-age tion are absorbing. served frequency • Variance of the allele age

Table 6.1: Calculations implemented in WFES2 with the constant population size model.

130 Invocation Model Statistics wfes switching Model with user-defined • Probability of fixation, extinction --absorption epochs. Extinction and fix- ation are possible in every submodel • Expected time to fixation, extinction • Variance of time to fixation, extinction • Probability of fixation, extinction in every submodel • Expected time spent in each submodel be- fore fixation, extinction wfes switching Only fixation is absorbing • Expected time to substitution (between --fixation in every submodel fixations) • Rate of substitution wfes sweep Only fixation is absorbing • Expected time to substitution with standing in last submodel variation and recurrent mutation • Rate of substitution

Table 6.2: Calculations implemented in WFES2 with the Markov-modulated Wright-Fisher model.

Invocation Model Statistics wfafle Neither fixation nor extinc- • Exact probability distribution of allele tion are absorbing counts following a demographic history wfas Neither fixation nor extinc- • Approximate distribution of allele counts tion are absorbing following a demographic history

Table 6.3: Calculations implemented in WFES2 concerning the distribution of allele counts under a given piece-wise constant demographic history.

131 Invocation Model Statistics phase type dist Fixation is the only absorb- • Distribution of times to substitution ing state phase type moments Fixation is the only absorb- • Moments of the distribution of times ing state to substitution time dist Both fixation and extinc- • Distribution of time to fixation and tion are absorbing extinction over time time dist skip Fixation is the only absorb- • Distribution of time to substitution, ing state excluding mutation time time dist sgv Only fixation is absorbing • Distribution of time to substitution, in adaptive model under a model with standing genetic vari- ation

Table 6.4: Calculations implemented in WFES2 for finding probability distributions of events over time. 6.3 Methods and Results

The overall computational approach used by most WFES2 modules is described in detail in [1] and in earlier chapters. Time-heterogeneous Markov-modulated Wright-Fisher models are introduced in Chapter 4, and standing genetic variation models are introduced in Chapter 5. Here, we describe additional details about how probability distributions are computed, about the computation of the phylogenetic substitution rate, and about the performance of a variety of optional approximations that extend the applicability of WFES2.

6.3.1 Time-dependent distributions

The probability distribution over the states of the Markov process at time t, f(t), starting at an initial distribution f(0) is:

f(t) = f(0)Pt (6.3.1)

for the transition probability matrix P. Finding the matrix power in the above equation is computationally difficult since it involves taking a large matrix to a large power. The result can instead be found by iterative sparse vector-matrix multiplication of the form:

132 f(t + 1) = f(t)P (6.3.2)

Since P of the WF process is sparse, the calculation can benefit from efficient sparse linear algebra approaches [18]. One straightforward application of 6.3.1 is the calculation of allele count probability distributions over time. In such cases, we do not treat any of the states of P as absorbing.

We describe this approach in detail in section 4.3.3, and it is implemented in wfafle, (table 6.3). It is also possible to calculate probability distributions for the time to absorption in a similar manner to equation 6.3.1. Instead of considering the allele count distribution, we instead focus on the probability of entering an absorbing state after t generations. In this case, we substitute matrix Q for P, where Q corresponds to the sub-matrix of P for only transient-to-transient state transitions. Then, for an absorbing state k, the time distribution at t can be written as:

X t−1 t−1 Pk(t) = Ri,k(f(0)P ) = Rk · (f(0)P ) (6.3.3) i

th Above, Rk represents the k column of R, which contains the probabilities of absorption from some transient state i to absorbing state k within one generation. (·) is the dot- product. As above, this equation can be rewritten as a recurrence relation requiring only sparse matrix-vector multiplications. This requires iteratively computing the expected allele count distribution at each time step, f(t) as above, but only for the transient states (using Q instead of P in equation 6.3.2). Then, we can write

Pk(t + 1) = Rk · f(t) (6.3.4)

133 Figure 6.1 shows an example of computed probability distribution for time to fixation in a population that starts with a single mutant. The fixation time distribution is heavy tailed and has a large lag, and its average is approximately 4N, as expected from standard theory [103]. The calculation was performed using equation (6.3.4).

0.000025

0.000020

0.000015

0.000010 Probability of fixation

0.000005

0.000000

0 50000 100000 150000 200000 Generations

Figure 6.1: Probability distribution of time to fixation in the Wright-Fisher model, starting with 1 mutant copy. At early times t < 1000, fixation of the new allele is virtually impossible. The mean fixation time (≈ 4N) is shown with dashed grey line. N = 10, 000, u = v = 1e−9, s = 0

6.3.2 Discrete phase-type distributions

The time to absorption probability distribution in a Markov chain with a single absorbing state is well studied problem with many applications, and is known as a phase-type distri- bution [102]. These distributions arise in WF when we are considering the distribution of time to the next substitution.

134 Moment Value Moment Value 1 100794 11 4.18626e+62 2 2.02392e+10 12 5.04348e+68 3 6.09589e+15 13 6.58259e+74 4 2.44805e+21 14 9.25227e+80 5 1.22889e+27 15 1.39336e+87 6 7.40265e+32 16 2.23823e+93 7 5.20246e+38 17 3.82013e+99 8 4.17851e+44 18 6.90356e+105 9 3.77561e+50 19 1.31689e+112 10 3.79062e+56 20 2.64425e+118

Table 6.5: First 20 moments of the probability distribution of time to the next substitution.

The calculation can be performed with equation 6.3.4. However, with small mutation rates, the distribution has a heavy tail, and thus needs to be calculated over a large range of t. This can be computationally demanding. If instead we only require the moments of the distribution, we can use an alternative approach ([104], algorithm 1). This approach requires the LU decomposition of the (I − Q) matrix, which can be calculated efficiently. The decomposition is then used to calculate the moments of the distribution by iteratively solving a series of linear systems. This operation is very fast and requires only seconds.

Figure 6.2 shows the phase-type distribution calculated in WFES2. Note that here we used a smaller population size, and the x-axis is on a log scale. The dashed vertical line indicates the mean expected time to the next substitution, which in this mutation-limited range (θ = 4×10−3) is approximately equal to the inverse of the mutation rate. Notice again the delay before the substitution has non-negligible probability. We used the iterative method [104] for calculating the moments of the fixation time distribution for the same model conditions in figure 6.2. The numerical values of the first 20 moments are shown in table 6.5.

135 0.000010

0.000008

0.000006

0.000004 Probability of substitution 0.000002

0.000000

0 1 2 3 4 5 6 10 10 10 10 10 10 10 Generations

Figure 6.2: Probability distribution of time to the next substitution in the Wright-Fisher model (s = 0, neutral case). This is computed as the time to absorption when only fixation is an absorbing state, given that we started in a monomorphic wildtype population. Note the large lag - substitutions are impossible for the first 100 generations in this case. The 1 5 mean substitution time (E[tsub] ≈ µ = 10 ) is shown with dashed grey line. N = 100, u = v = 1e − 5, s = 0

136 6.4 Small population size approximation and trunca-

tion

WFES2 implements several optional approximations intended to increase its applicability and performance. First, as it is sometimes valuable to control the sparsity of the transition probability

matrix, WFES2 implements a novel method for truncating rows of the transition probability matrix by limiting computation to the central mass of the binomial distribution 2.3.14 (see supplement).

Second, while the techniques used in WFES2 are very efficient, there still may be cases where populations of interest may be too large for practical consideration. In these cases, a potential solution is to try to model a large population by a smaller population with parame- ters rescaled to conserve the behaviour of the unapproximated model as well as possible (e.g., [64]). In general this requires rescaling parameters in the model to conserve their population scaled values. When computing times, this technique also requires an appropriate rescal- ing of time (usually by scaling time by the ratio of the approximation to true population sizes). When allele count probability distributions are of interest, these may naturally be rescaled as frequencies. An alternative approach that seems to work well is to use a single generation jump from one population size to another (see Chapter 4). While these tech- niques are widely used, it may not be widely appreciated that they constitute an imperfect approximation, which can perform differently in different parameter regimes (below). Figure 6.3 shows the probability distributions (log scale) for times to fixation and extinc- tion using different approximating population sizes, compared to the true population size (N = 10, 000). For each approximating distribution, time is rescaled using the ratio of the approximating to true population sizes so that time has the same meaning in each model (this is why the extinction time distributions appear to start at different times along the x-axis, even though they were all computed starting from generation 1). The probability

137 distribution of time to fixation appears to be well approximated with small approximating population sizes. In comparison, the distribution of time to extinction varies more when smaller population sizes are used. This can be explained by the starting conditions of the model. Indeed, it is a general limitation of approximating a larger population with a smaller one (in discrete state spaces) that the initial conditions can never be matched exactly. That is, starting in 1 copy in a large population corresponds to starting in fewer than one copy in a smaller population size, which cannot be represented in a discrete model with a minimum of one or zero starting copies. This problem is more consequential for the time to extinction compared to fixation, because the starting number of copies has a much greater effect on how quickly extinction will occur (where initial conditions are already very close to the absorbing boundary) than on the rate of fixation (where initial conditions are as far as possible from the relevant absorbing boundary).

0 A 10 B

2 10 2 10

5 10 4 10 Approximating population size 8 10 100 6 500 10 1000 11 10 5000 8 10000 10 (Exact) Probability of fixation

14 Probability of extinction 10 10 10

17 10 12 10

20 10 3 4 5 0 1 2 3 4 5 10 10 10 10 10 10 10 10 10 Generations Generations

Figure 6.3: Probability distribution of time to fixation (A) and extinction (B) for a neutral variant, with different approximating population sizes and θ = 4Nµ. True population size Ntrue = 10, 000. The distribution of time to fixation is well approximated. In comparison, the time to extinction distribution diverges with a smaller population size. θ = 4 × 10−5 for all graphs, 2Ns = 0, starting in a single derived allele copy.

138 The effect of selection can also be examined on the performance of the small population size approximation. With increasingly strong positive selection, the performance of the approximation worsens (Figures 6.4,6.5). While with 2Ns = 10 the approximation works about as well as in the neutral case, with 2Ns = 100 the results in approximating population sizes diverge more substantially from the true distribution (Figure 6.5). It should therefore be kept in mind that while small population sizes can approximate the behaviour of larger population size models well for certain statistics, the finiteness of the initial distribution can be problematic.

6.5 Conclusions

WFES2 provides an efficient computational framework for the comprehensive characterization of many short and long term behaviours of the Wright-Fisher model (including for varia- tions of the model including time-heterogeneous evolutionary forces and/or population sizes). Computations of many quantities of interest are implemented, and variances and higher mo- ments are provided when possible. Rapid computation of full probability distributions is also now supported.

WFES2 is intended to serve as a modelling workbench, by facilitating the rapid exploration of model properties without restrictive simplifying assumptions or the need for high replicate simulations. As all computations made by WFES2 are algebraically exact, the program is very general and can be easily applied to models with different forces and parameters included. Even in large population sizes, all supported computations generally take only seconds, but may sometimes take as long as a few minutes on modern computer architectures. It is our hope that this platform may help stimulate a renewed interest in Markov models in population genetics, whose utility and power have arguably been underappreciated.

139 6.6 Supplement

6.6.1 Adjustable sparsity threshold

The rows of the Wright-Fisher model are binomial probability distributions. While the binomial distribution has support between 0 and sample size M, in the Wright-Fisher model specifically, a large change in allele count within a single generation is generally unlikely. To increase the sparsity of the matrix, the first version of WFES optionally ignored all entries with probability below some  (10−25 by default). However, this computation still required computing each entry of the matrix. The second version of WFES implements an improved procedure, which instead uses 1−α mass of the probability distribution for a given row. The calculation uses the cumulative binomial distribution. By default, α = 10−20. Increasing the value of the parameter results in faster matrix build times and computation, at a sacrifice of precision. In our tests, values below α = 10−15 were indistinguishable from α = 0 (full distribution without truncation). With α = 10−5, the relative error of summary statistics did not exceed 0.03% with population size N = 5 × 105.

6.6.2 Distributions of time to absorption with selection

140 A B 1 2 10 10

3 5 10 10

5 10 8 10 100 500 7 10 1000 11 10 5000 9 10000 10 Probability of fixation

14 Probability of extinction 10 11 10

17 10 13 10

20 10 3 4 5 0 1 2 3 4 5 10 10 10 10 10 10 10 10 10 Generations Generations

Figure 6.4: Probability distribution of time to fixation (A) ant extinction (B), with different approximating population sizes, preserving θ = 4Nµ and 2Ns = 10. True population size Ntrue = 10, 000. The distribution of time to fixation is well approximated. In comparison, the time to extinction distribution diverges with a smaller population size. θ = 4 × 10−5 for all graphs, 2Ns = 10, starting in a single derived allele copy.

141 A B 1 2 10 10

4 5 10 10

7 8 10 10 100 500 1000 11 10 10 10 5000 10000 Probability of fixation

14 Probability of extinction 13 10 10

17 16 10 10

20 10 3 4 5 0 1 2 3 4 10 10 10 10 10 10 10 10 Generations Generations

Figure 6.5: Probability distribution of time to fixation (A) ant extinction (B), with different approximating population sizes, preserving θ = 4Nµ and 2Ns = 100. True population size Ntrue = 10, 000. The distribution of time to fixation is well approximated. In comparison, the time to extinction distribution diverges with a smaller population size. θ = 4 × 10−5 for all graphs, 2Ns = 100, starting in a single derived allele copy.

142 Chapter 7

Future Directions

In this work, I considered a variety of problems relevant to understanding the effects of mutation and selection on molecular evolution. Despite that these methods provided new insights, and are generally applicable to several important problems, this work still has its limitations. In particular, like others, we have focused on biallelic models for tractability. When fo- cusing on intraspecific sequence comparisons, this is generally a reasonable assumption for most sites (i.e., in current human population genomic databases, there are relatively few positions with more than 2 alleles observed in the population). However, for interspecific sequence comparisons, we want multiallelic models that can account for both recurrent mu- tation and substitution at individual (finite) sites. As a result, a major direction of interest is to extend this work to consider multi-allelic evolution in such a way that mutation to different allelic types is allowed in each generation. This would be significantly more com- plicated than in current phylogenetic substitution models (see Teufel et al.[9] for review), which to our knowledge universally assume that mutations arise and reach their fates one by one, as justified by a strongly mutation-limited view of sequence evolution. As pointed out in this dissertation, this view may be at odds with the observation that most known cases of selective sweeps are soft. Ideally, such an approach would allow for competition among

143 allelic types segregating at the same position. One way this might be accomplished is to build a full multi-allelic Wright-Fisher model.

7.1 Multiple Alleles

The number of states in a k-allele Wright-Fisher model, where the sum of all the allele counts equals the population size N is described by the composition function:

N + k − 1 Qk−1(i + N) |WF (N, k)| = = i=0 (7.1.1) k − 1 (k − 1)!

For k = 2, the formula above yields N, so that the dimensionality of the transition matrix is N 2. With a larger number of alleles, the number of valid states grows exponentially. This implies that multi-allelic discrete Wright-Fisher models might not be feasible in practice. While it should be possible to simplify the calculations somewhat due to the sparsity of the transition matrices, computations with large population sizes remain out of reach practically. A promising alternative is to consider the Moran [105, 5] model, that we have briefly described in chapter 2. The Moran model has a much sparser transition probability matrix, and it can be feasibly extended to multiple alleles. Moran model yields probabilities of absorption compatible with the Wright-Fisher model, and the times are comparable up to a constant factor (due to different definition of generation time in each model). Despite the advantages, the Moran model is not easily amenable to the Markov-modulated approach we use in chapters 4 and 5. Since the Moran model only allows transitions to neighboring allele counts, large population size changes introduce discontinuities into the transition probability matrix 4.3.8. Due to this limitation, we did not pursue it here. In related future work, we are interested in using the more realistic models developed here to power hybrid mutation-selection type models of codon substitution and segregation, which could be used to infer selection coefficients jointly from human population genomic data, and from a large collection of reference genomes in non-human mammals. While the across-

144 species comparisons are highly informative about the variants that ‘work’, within-species comparisons have much greater potential to illuminate the relative effects of deleterious variants [106]. Combining both types of data should yield informative estimates of the deleteriousness of specific mutations across the genome, which would be highly prized by clinical geneticists (especially among those studying rare genetic disease). Ideally, we would like to be able to use Wright-Fisher type models to directly compute the likelihood of within and between species sequence alignments. Some work in this vein has been being pursued by other groups. For example, Tataru et al. [107] and Hobolth and Siren [108] have developed approximate diffusion based approaches for computing likelihoods of phylogenetic sequence comparisons. In addition, Maio et al. [109] have developed so-called ‘polymorphism aware’ models of phylogenetic substitution that include intermediate states that represent unfixed variants segregating in different frequency classes. Both of these approaches have some benefits, and certainly others are possible as well.

7.2 Comparison to diffusion theory

The main results of diffusion theory have been widely used in population genetics and evolu- tionary biology. The core assumption of the diffusion approach is that the allele frequencies do not change to a large degree within a single generation. This assumptions holds well with large population, weak selection, and weak mutation, known together as the diffusion limit [110]. The diffusion limit is not guaranteed to hold, and is violated with strong mutation rate, which happens in many cases in nature (discussed in chapter 3). The direct computational approach has the advantage of not making the assumption about the per-generation rate of change, and is therefore robust to this issue. An important result of our work, which has been described in chapter 5, is the approach of the allele frequency to equilibrium. The equilibrium is a common and mathematically

145 convenient assumption (we employ it in section 4.4.3). However, as the section shows, if the population starts in a monomorphic state, the equilibrium for neutral or advantageous variants will not be reached within millions of generations. In other words, the mutation- selection balance should only be applicable to deleterious variants. Mutation-drift balance is therefore a very strong assumption, and should be treated accordingly.

7.3 Coalescent

We did not consider the coalescent approaches in this work. However, a large fraction of population-genetic analyses is based on coalescent theory results or simulations. Ancestral recombination graphs [111, 112] are an exciting approach to model and make inferences about multiple loci subject to recombination. The general principle of constructing full transition probability matrices can be applied to some of the coalescent problems, which I intend to address in my future work. It is particularly exciting that recent simulation work is focusing on combining coalescent and Wright-Fisher models for investigating large datasets [113]. It is possible to consider the discrete coalescent process and model it explicitly, using the methodology described here. The traditional Kingman [100] coalescent is based on the assumption of only a single pair of lineages coalescing in a single generation. If we construct the transition probability matrix, it should be possible to include different types of events - such as coalescence of multiple pairs within the same generation. This relaxes the approximate nature of Kingman’s approach, and makes the framework applicable to increasingly large sample sizes.

146 Bibliography

[1] Ivan Krukov, Bianca de Sanctis, and A. P. Jason de Koning. Wright–Fisher exact solver (WFES): Scalable analysis of population genetic models without simulation or diffusion theory. Bioinformatics, 33(9):1416–1417, May 2017.

[2] Bianca de Sanctis, Ivan Krukov, and A. P. Jason de Koning. Allele Age Under Non- Classical Assumptions is Clarified by an Exact Computational Markov Chain Ap- proach. Scientific Reports, 7(1):11869, September 2017.

[3] Ronald Aylmer Fisher. On the Dominance Ratio. Proceedings of the Royal Society of Edinburgh, 42:321–241, 1922.

[4] Sewall Wright. Evolution in Mendelian Populations. Genetics, 16(2):97–159, January 1931.

[5] W. J. Ewens. Mathematical Population Genetics: I. Theoretical Introduction., vol- ume 27 of Interdisciplinary Applied Mathematics. Springer New York, New York, 2 edition, 2004. OCLC: 958522782.

[6] Sean H. Rice. Evolutionary Theory: Mathematical and Conceptual Foundations. Sin- auer Associates, Sunderland, MA, USA, 2004.

[7] Motoo Kimura. Diffusion Models in Population Genetics. Journal of Applied Proba- bility, 1(2):177, December 1964.

147 [8] John Kemeny and Laurie Snell. Finite Markov Chains. Undergraduate Texts in Math- ematics. Springer, 1960.

[9] Ashley I. Teufel, Andrew M. Ritchie, Claus O. Wilke, and David A. Liberles. Using the Mutation-Selection Framework to Characterize Selection on Protein Sequences. Genes, 9(8), August 2018.

[10] Brian Golding and Joe Felsenstein. A maximum likelihood approach to the detection of selection from a phylogeny. Journal of Molecular Evolution, 31(6):511–523, December 1990.

[11] A. L. Halpern and W. J. Bruno. Evolutionary distances for protein-coding sequences: Modeling site-specific residue frequencies. Molecular Biology and Evolution, 15(7):910– 917, July 1998.

[12] David M. McCandlish and Arlin Stoltzfus. Modeling Evolution Using the Probability of Fixation: History and Implications. The Quarterly Review of Biology, 89(3):225–252, September 2014.

[13] Nicolas Rodrigue, Herv´ePhilippe, and Nicolas Lartillot. Mutation-selection models of coding sequence evolution with site-heterogeneous amino acid fitness profiles. Proceed- ings of the National Academy of Sciences, 107(10):4629–4634, March 2010.

[14] Motoo Kimura. Diffusion Models in Population Genetics. Journal of Applied Proba- bility, 1(2):177–232, December 1964.

[15] Lei Zhao, Xingye Yue, and David Waxman. Complete Numerical Solution of the Diffusion Equation of Random Genetic Drift. Genetics, 194(4):973–985, August 2013.

[16] Peter D. Keightley and Adam Eyre-Walker. Joint Inference of the Distribution of Fit- ness Effects of Deleterious Mutations and Population Demography Based on Nucleotide Polymorphism Frequencies. Genetics, 177(4):2251–2261, December 2007.

148 [17] A.P. Jason de Koning and Bianca D. de Sanctis. The Rate of Molecular Evolution When Mutation May Not Be Weak. Nature Communications, August 2018.

[18] Intel Sowtware Corporation. INTEL MKL PARDISO - Parallel Direct Sparse Solver Interface, 2019.

[19] O. Schenk, K. G¨artner,and W. Fichtner. Efficient Sparse LU Factorization with Left- Right Looking Strategy on Shared Memory Multiprocessors. BIT Numerical Mathe- matics, 40(1):158–176, March 2000.

[20] Olaf Schenk, Klaus G¨artner,Wolfgang Fichtner, and Andreas Stricker. PARDISO: A high-performance serial and parallel sparse linear solver in semiconductor device simulation. Future Generation Computer Systems, 18(1):69–78, September 2001.

[21] Danil Nemirovsky. Tensor approach to mixed high-order moments of absorbing Markov chains. Report, INRIA, 2009.

[22] Motoo Kimura. On the Probability of Fixation of Mutant Genes in a Population. Genetics, 47(6):713–719, June 1962.

[23] Wenqing Fu, Timothy D. O’Connor, Goo Jun, Hyun Min Kang, Goncalo Abecasis, Suzanne M. Leal, Stacey Gabriel, David Altshuler, Jay Shendure, Deborah A. Nicker- son, Michael J. Bamshad, Broad GO, Seattle GO, and Joshua M. Akey. Analysis of 6,515 exomes reveals a recent origin of most human protein-coding variants. Nature, 493(7431):216–220, January 2013.

[24] Takeo Maruyama. The age of an allele in a finite population. Genetics Research, 23(2):137–143, April 1974.

[25] T Maruyama. The age of a rare mutant gene in a large population. American Journal of Human Genetics, 26(6):669–673, November 1974.

149 [26] W H Li. The first arrival time and mean age of a deleterious mutant gene in a finite population. American Journal of Human Genetics, 27(3):274–286, May 1975.

[27] G. A. Watterson. Reversibility and the age of an allele II. Two-allele models, with selection and mutation. Theoretical Population Biology, 12(2):179–196, October 1977.

[28] R. C. Griffiths and Simon Tavar´e.The age of a mutation in a general coalescent tree. Communications in Statistics. Stochastic Models, 14(1-2):273–295, January 1998.

[29] Montgomery Slatkin and Bruce Rannala. Estimating Allele Age. Annual Review of Genomics and Human Genetics, 1(1):225–249, September 2000.

[30] Adam Kiezun, Sara L. Pulit, Laurent C. Francioli, Freerk van Dijk, Morris Swertz, Dorret I. Boomsma, Cornelia M. van Duijn, P. Eline Slagboom, G. J. B. van Ommen, Cisca Wijmenga, Genome of the Netherlands Consortium, Paul I. W. de Bakker, and Shamil R. Sunyaev. Deleterious Alleles in the Human Genome Are on Average Younger Than Neutral Alleles of the Same Frequency. PLOS Genetics, 9(2):e1003301, February 2013.

[31] Motoo Kimura and Tomoko Ohta. The Average Number of Generations until Fixation of a Mutant Gene in a Finite Population. Genetics, 61(3):763–771, March 1969.

[32] Philipp M. Altrock, Chaitanya S. Gokhale, and Arne Traulsen. Stochastic slowdown in evolutionary processes. Physical Review E, 82(1):011925, July 2010.

[33] Philipp M. Altrock, Arne Traulsen, and Tobias Galla. The mechanics of stochastic slowdown in evolutionary games. Journal of Theoretical Biology, 311:94–106, October 2012.

[34] Fabrizio Mafessoni and Michael Lachmann. Selective Strolls: Fixation and Extinction in Diploids Are Slower for Weakly Selected Mutations Than for Neutral Ones. Genetics, 201(4):1581–1589, December 2015.

150 [35] Asher D. Cutter, Richard Jovelin, and Alivia Dey. Molecular hyperdiversity and evo- lution in very large populations. Molecular Ecology, 22(8):2074–2095, 2013.

[36] Frank Maldarelli, Mary Kearney, Sarah Palmer, Robert Stephens, JoAnn Mican, Michael A. Polis, Richard T. Davey, Joseph Kovacs, Wei Shao, Diane Rock-Kress, Julia A. Metcalf, Catherine Rehm, Sarah E. Greer, Daniel L. Lucey, Kristen Danley, Harvey Alter, John W. Mellors, and John M. Coffin. HIV Populations Are Large and Accumulate High Genetic Diversity in a Nonlinear Fashion. Journal of Virology, 87(18):10313–10323, September 2013.

[37] Timothy J. C. Anderson, Shalini Nair, Marina McDew-White, Ian H. Cheeseman, Standwell Nkhoma, Fatma Bilgic, Rose McGready, Elizabeth Ashley, Aung Pyae Phyo, Nicholas J. White, and Fran¸coisNosten. Population Parameters Underlying an Ongo- ing Soft Sweep in Southeast Asian Malaria Parasites. Molecular Biology and Evolution, 34(1):131–144, January 2017.

[38] Philipp W. Messer and Dmitri A. Petrov. Population genomics of rapid adaptation by soft selective sweeps. Trends in Ecology & Evolution, 28(11):659–669, November 2013.

[39] Jeffrey D. Jensen. On the unfounded enthusiasm for soft selective sweeps. Nature Communications, 5:5281, October 2014.

[40] Talia Karasov, Philipp W. Messer, and Dmitri A. Petrov. Evidence that Adaptation in Drosophila Is Not Limited by Mutation at Single Sites. PLOS Genetics, 6(6):e1000924, June 2010.

[41] Pleuni S. Pennings and Joachim Hermisson. Soft Sweeps II—Molecular Population Genetics of Adaptation from Recurrent Mutation or Migration. Molecular Biology and Evolution, 23(5):1076–1084, May 2006.

[42] Motoo Kimura and Tomoko Ohta. The Age of a Neutral Mutant Persisting in a Finite Population. Genetics, 75(1):199–212, September 1973.

151 [43] Kyung C. Chae and Tae S. Kim. Reversed absorbing Markov chain: A simple path approach. Operations Research Letters, 16(1):41–46, August 1994.

[44] Shuhao Qiu and Alexei Fedorov. Maruyama’s allelic age revised by whole-genome GEMA simulations. Genomics, 105(5):282–287, May 2015.

[45] Matthias Steinrucken, Ethan M. Jewett, and Yun S. Song. SpectralTDF: Transition densities of diffusion processes with time-varying selection parameters, mutation rates and effective population sizes. Bioinformatics, 32(5):795–797, March 2016.

[46] Yun S. Song and Matthias Steinrucken. A Simple Method for Finding Explicit Analytic Transition Densities of Diffusion Processes with General Diploid Selection. Genetics, 190(3):1117–1129, March 2012.

[47] D. Waxman. A Unified Treatment of the Probability of Fixation when Population Size and the Strength of Selection Change Over Time. Genetics, 188(4):907–913, August 2011.

[48] Steven N. Evans, Yelena Shvets, and Montgomery Slatkin. Non-equilibrium theory of the allele frequency spectrum. Theoretical Population Biology, 71(1):109–119, February 2007.

[49] P. R. Amestoy, I. S. Duff, and J. Y. L’Excellent. Multifrontal parallel distributed symmetric and unsymmetric solvers. Computer Methods in Applied Mechanics and Engineering, 184(2):501–520, April 2000.

[50] Sewall Wright. Evolution in Mendelian Populations. Genetics, 16(2):97–159, March 1931.

[51] R. A. Fisher. The Genetical Theory of Natural Selection: A Complete Variorum Edi- tion. OUP Oxford, 1930.

152 [52] William Feller. Diffusion Processes in Genetics. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability. The Regents of the University of California, 1951.

[53] Sarah Otto and Michael Whitlock. The Probability of Fixation in Populations of Changing Size. Genetics, 146:723–733, March 1997.

[54] Motoo Kimura. Process Leading to Quasi-Fixation of Genes in Natural Populations Due to Random Fluctuation of Selection Intensities. Genetics, 39(3):280–295, May 1954.

[55] Steven N. Evans, Yelena Shvets, and Montgomery Slatkin. Non-equilibrium theory of the allele frequency spectrum. Theoretical Population Biology, 71(1):109–119, February 2007.

[56] Daniel Zivkovi´c,Matthiasˇ Steinr¨ucken, Yun S. Song, and Wolfgang Stephan. Transition Densities and Sample Frequency Spectra of Diffusion Processes with Selection and Variable Population Size. Genetics, 200(2):601–617, June 2015.

[57] Samuel Karlin. Rates of Approach to Homozygosity for Finite Stochastic Models with Variable Population Size. The American Naturalist, 102(927):443–455, September 1968.

[58] Sewall Wright. Size of population and breeding structure, in relation to evolution. Science, 87(2263):430–432, 1938.

[59] John H. Gillespie. Population Genetics, a Concise Guide. John Hopkins University Press, Baltimore and London, 2 edition, 2004.

[60] Takeo Maruyama and Motoo Kimura. A Note on the Speed of Gene Frequency Changes in Reverse Directions in a Finite Population. Evolution, 28(1):161–163, 1974.

153 [61] Ryan N. Gutenkunst, Ryan D. Hernandez, Scott H. Williamson, and Carlos D. Bus- tamante. Inferring the Joint Demographic History of Multiple Populations from Mul- tidimensional SNP Frequency Data. PLOS Genetics, 5(10):e1000695, October 2009.

[62] Bernard Y. Kim, Christian D. Huber, and Kirk E. Lohmueller. Inference of the Dis- tribution of Selection Coefficients for New Nonsynonymous Mutations Using Large Samples. Genetics, 206(1):345–361, May 2017.

[63] Chris Tuffley and Mike Steel. Modeling the covarion hypothesis of nucleotide substi- tution. Mathematical Biosciences, 147(1):63–91, January 1998.

[64] Kai Zeng and Brian Charlesworth. Estimating Selection Intensity on Synonymous Codon Usage in a Nonequilibrium Population. Genetics, 183(2):651–662, October 2009.

[65] John Wakeley. The Limits of Theoretical Population Genetics. Genetics, 169(1):1–7, January 2005.

[66] Ziheng Yang. PAML 4: Phylogenetic Analysis by Maximum Likelihood. Molecular Biology and Evolution, 24(8):1586–1591, August 2007.

[67] Reed A. Cartwright, Nicolas Lartillot, and Jeffrey L. Thorne. History Can Matter: Non-Markovian Behavior of Ancestral Lineages. Systematic Biology, 60(3):276–290, May 2011.

[68] Joachim Hermisson and Pleuni S. Pennings. Soft Sweeps: Molecular Population Ge- netics of Adaptation From Standing Genetic Variation. Genetics, 169(4):2335–2352, April 2005.

[69] Nandita R. Garud and Dmitri A. Petrov. Elevated and Signa- tures of Soft Sweeps Are Common in Drosophila melanogaster. Genetics, 203(2):863– 880, June 2016.

154 [70] Sara Sheehan and Yun S. Song. Deep Learning for Population Genetic Inference. PLOS Computational Biology, 12(3):e1004845, March 2016.

[71] Todd A. Schlenke and David J. Begun. Linkage Disequilibrium and Recent Selection at Three Immunity Receptor Loci in Drosophila simulans. Genetics, 169(4):2013–2022, April 2005.

[72] Molly Prezeworski, Graham Coop, and Jeffrey D. Wall. The Signature of Positive Selection on Standing Genetic Variation. Evolution, 59(11):2312–2323, 2005.

[73] Benjamin M. Peter, Emilia Huerta-Sanchez, and Rasmus Nielsen. Distinguishing be- tween Selective Sweeps from Standing Variation and from a De Novo Mutation. PLOS Genetics, 8(10):e1003011, October 2012.

[74] Daniel R. Schrider and Andrew D. Kern. Soft Sweeps Are the Dominant Mode of Adaptation in the Human Genome. Molecular Biology and Evolution, 34(8):1863– 1877, August 2017.

[75] Novembre John and Han Eunjung. Human population structure and the adaptive response to pathogen-induced selection pressures. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1590):878–886, March 2012.

[76] Roger Paredes, Christina M. Lalama, Heather J. Ribaudo, Bruce R. Schackman, Ce- cilia Shikuma, Francoise Giguel, William A. Meyer, Victoria A. Johnson, Susan A. Fiscus, Richard T. D’Aquila, Roy M. Gulick, and Daniel R. Kuritzkes. Pre-existing Minority Drug-Resistant HIV-1 Variants, Adherence, and Risk of Antiretroviral Treat- ment Failure. The Journal of Infectious Diseases, 201(5):662–671, March 2010.

[77] Benjamin A. Wilson, Nandita R. Garud, Alison F. Feder, Zoe J. Assaf, and Pleuni S. Pennings. The population genetics of drug resistance evolution in natural populations of viral, bacterial and eukaryotic pathogens. Molecular Ecology, 25(1):42–66, 2016.

155 [78] Cassandra B. Jabara, Corbin D. Jones, Jeffrey Roach, Jeffrey A. Anderson, and Ronald Swanstrom. Accurate sampling and deep sequencing of the HIV-1 protease gene using a Primer ID. Proceedings of the National Academy of Sciences, 108(50):20166–20171, December 2011.

[79] Pleuni S. Pennings, Sergey Kryazhimskiy, and John Wakeley. Loss and Recovery of Genetic Diversity in Adapting Populations of HIV. PLOS Genetics, 10(1):e1004000, January 2014.

[80] Shalini Nair, Denae Nash, Daniel Sudimack, Anchalee Jaidee, Marion Barends, Anne- Catrin Uhlemann, Sanjeev Krishna, Fran¸coisNosten, and Tim J. C. Anderson. Recur- rent Gene Amplification and Soft Selective Sweeps during Evolution of Multidrug Re- sistance in Malaria Parasites. Molecular Biology and Evolution, 24(2):562–573, Febru- ary 2007.

[81] H. Allen Orr and Andrea J. Betancourt. Haldane’s Sieve and Adaptation From the Standing Genetic Variation. Genetics, 157(2):875–884, February 2001.

[82] Pleuni S. Pennings and Joachim Hermisson. Soft Sweeps III: The Signature of Positive Selection from Recurrent Mutation. PLOS Genetics, 2(12):e186, December 2006.

[83] Daniel R. Schrider, F´abioK. Mendes, Matthew W. Hahn, and Andrew D. Kern. Soft Shoulders Ahead: Spurious Signatures of Soft and Partial Selective Sweeps Result from Linked Hard Sweeps. Genetics, 200(1):267–284, May 2015.

[84] Daniel R. Schrider and Andrew D. Kern. S/HIC: Robust Identification of Soft and Hard Sweeps Using Machine Learning. PLOS Genetics, 12(3):e1005928, March 2016.

[85] Brian Charlesworth and Kavita Jain. Purifying Selection, Drift, and Reversible Mu- tation with Arbitrarily High Mutation Rates. Genetics, 198(4):1587–1602, December 2014.

156 [86] Joachim Hermisson and Pleuni S. Pennings. Soft sweeps and beyond: Understanding the patterns and probabilities of selection footprints under rapid adaptation. Methods in Ecology and Evolution, 8(6):700–716, 2017.

[87] C.C. Paige, P.H. Styan, and P.G. Wachter. Computation of the stationary distribution of a Markov Chain. Journal of Statistical Computation and Simulation, 4:173–186, 1975.

[88] Motoo Kimura. The Neutral Theory of Molecular Evolution by Motoo Kimura. Cam- bridge Core - Molecular Biology, Biochemistry, and Structural Biology. Cambridge University Press, October 1983.

[89] Motoo Kimura and James F. Crow. The Measurement of Effective Population Number. Evolution, 17(3):279–288, 1963.

[90] Brian Charlesworth and Deborah Charlesworth. Elements of Evolutionary Genetics. W. H. Freeman, Greenwood Village, Colo, 1st edition edition, February 2010.

[91] Igor M. Rouzine, John M. Coffin, and Leor S. Weinberger. Perspective Fifteen Years Later: Hard and Soft Selection Sweeps Confirm a Large Population Number for HIV In Vivo. PLOS Genetics, 2014.

[92] Austin L. Hughes and Mary Ann K. Hughes. More Effective Purifying Selection on RNA Viruses than in DNA Viruses. Gene, 404(1-2):117, December 2007.

[93] Claus O. Wilke. The Speed of Adaptation in Large Asexual Populations. Genetics, 167(4):2045–2053, August 2004.

[94] J. Maynard Smith. What Determines the Rate of Evolution? The American Naturalist, 110(973):331–338, May 1976.

[95] Walter Messier and Caro-Beth Stewart. Episodic adaptive evolution of primate lysozymes. Nature, 385(6612):151–154, January 1997.

157 [96] W. Harrod and R. Plemmons. Comparison of Some Direct Methods for Computing Stationary Distributions of Markov Chains. SIAM Journal on Scientific and Statistical Computing, 5(2):453–469, June 1984.

[97] C Cannings. The latent roots of certain Markov chains arising in genetics: A new approach. Advanced applied Probability, 6:260–290, 1974.

[98] Yixuan Qui. Spectra - Sparse Eigenvalue Computation Toolkit as a Redesigned {ARPACK}, 2019.

[99] J. B. S. Haldane. A Mathematical Theory of Natural and Artificial Selection, Part V: Selection and Mutation. Mathematical Proceedings of the Cambridge Philosophical Society, 23(7):838–844, July 1927.

[100] J. F. C. Kingman. On the genealogy of large populations. Journal of Applied Proba- bility, 19(A):27–43, 1982/ed.

[101] Laura Kubatko. The Multispecies Coalescent. In Handbook of Statistical Genomics, pages 219–246. John Wiley & Sons, Ltd, 2019.

[102] Marcel F. Neuts. Matrix-Geometric Solutions in Stochastic Models: An Algorithmic Approach. Dover Publications, New York, revised edition edition, January 1995.

[103] Motoo Kimura and Tomoko Ohta. The Average Number of Generations until Fixation of a Mutant Gene in a Finite Population. Genetics, 61(3):763–771, March 1969.

[104] Tu˘grulDayar. On Moments of Discrete Phase-Type Distributions. In Formal Tech- niques for Computer Systems and Business Processes, volume 3670, pages 51–63. Eu- ropean Performance Engineering Workshop Proceedings, November 2005.

[105] P. A. P Moran. The Statistical Processes of Evolutionary Theory. Clarendon Press, Oxford, 1962. OCLC: 307119.

158 [106] David S. Lawrie and Dmitri A. Petrov. Comparative population genomics: Power and principles for the inference of functionality. Trends in Genetics, 30(4):133–139, April 2014.

[107] Paula Tataru, Thomas Bataillon, and Asger Hobolth. Inference under a Wright-Fisher model using an accurate beta approximation. bioRxiv, page 021261, June 2015.

[108] Asger Hobolth and Jukka Siren. The multivariate Wright–Fisher process with muta- tion: Moment-based analysis and inference using a hierarchical Beta model. Theoretical Population Biology, 108:36–50, April 2016.

[109] Nicola De Maio, Dominik Schrempf, and Carolin Kosiol. PoMo: An Allele Frequency- Based Approach for Species Tree Estimation. Systematic Biology, 64(6):1018–1031, January 2015.

[110] Sarah Otto and Troy Day. A Biologist’s Guide to Mathematical Modeling in Ecology and Evolution. Princeton University Press, 2007.

[111] R.c. Griffiths and P. Marjoram. Ancestral Inference from Samples of DNA Sequences with Recombination. Journal of Computational Biology, 3(4):479–502, January 1996.

[112] Matthew D. Rasmussen, Melissa J. Hubisz, Ilan Gronau, and Adam Siepel. Genome- Wide Inference of Ancestral Recombination Graphs. PLoS Genetics, 10(5), May 2014.

[113] Dominic Nelson, Jerome Kelleher, Aaron P. Ragsdale, Gil McVean, and Simon Gravel. Coupling Wright-Fisher and coalescent dynamics for realistic simulation of population- scale datasets. bioRxiv, page 674440, June 2019.

159