Diversity in Evolving Systems: Scaling and Dynamics of Genealogical Trees by Erik Rauch B.S., Computer Science and Mathematics Yale University (1996) S.M., Electrical Engineering and Computer Science Massachusetts Institute of Technology (1999) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY

F~V~ rt ( j,-1 7 January 2004 @Erik Rauch, 2004. All rights reserved. The aumor hereby gtvlh to Mfr penmtson to reproduce and to duMibute pub* pope and elacwonic copes of this thesis documentin whole or in port A uthor ...... Department of Electrical Engineering and Computer Science January 30, 2004 Certified by...... Gerald Jay Sussman Matsushita Profgssor of Electrical Engineering Thsig's Sup ifrisor Accepted by...... Arthur C. Smith Chairman, Department Committee on Graduate Students

MASSACHUSETTS INSTIuTE OF TECHNOLOGY APR 15 2004 BARKER

LIBRARIES

Diversity in Evolving Systems: Scaling and Dynamics of Genealogical Trees

by Erik Rauch

Submitted to the Department of Electrical Engineering and Computer Science on January 30, 2004 in partial fulfillment of the degree of

Doctor of Philosophy

Abstract

Diversity is a fundamental property of all evolving systems. This thesis examines spa- tial and temporal patterns of diversity. The systems I will study consist of a population of individuals, each with a potentially unique state, together with a dynamics consist- ing of copying or of individual states with small modifications to them (innovations). I show that properties of diversity can be understood by modelling the evolving genealogical tree of the population. This formulation is general enough that it captures interesting features of a range of natural and artificial systems, though I will pay particular attention to genetic diversity in biological populations, and discuss the implications of the results to conservation. I show that diversity is unevenly distributed in populations, and a disproportionate fraction is found in small sub-populations. The evolution of diversity is a dynamic process, and I show that large fluctuations in diversity can result purely from the inter- nal dynamics of the population, and not necessarily from external causes. I also show how diversity is affected by the structure of the population (spatial or well-mixed), and determine the scaling of diversity with habitat area in spatial systems. Predictions from the model agree with existing experimental genetic data on global populations of bacteria. I then apply the method of modelling the genealogical tree of a population to further questions in evolution. Using a generic model of a pathogen evolving to coexist with a population of hosts, I show that the evolutionary dynamics of the system can be better understood by considering the dynamics of strains (groups of individuals descended from a common ancestor) rather than individuals. A fundamental question in the study of evolution is how selection can operate above the level of the individual, and these results suggests a more general mechanism for such selection.

3 Thesis supervisor: Gerald Jay Sussman Title: Matsushita Professor of Electrical Engineering

4 Acknowledgements

Yaneer Bar-Yam acted as a co-supervisor for this work and contributed especially to the presentation of the results and their relevance. Hiroki Sayama contributed to the work on host-pathogen evolution. Gerald Jay Sussman provided valuable guidance and strongly encouraged this work. Stephen Hubbell, James Tiedje, John Wakeley, Simon Levin, Mehran Kardar and Stuart Pimm provided useful comments on the diversity results. Jae-Chang Cho and James Tiedje provided the original figure with data on Pseudomonas bacteria populations. Charles Goodnight provided valuable comments on the host-pathogen evolution papers. Daniel Rothman and Joshua Weitz organized the Theoretical Ecology seminar which led to the genesis of the work on host-pathogen evolution.

5 Table of contents

Overview 7 1 Scaling, dynamics and distribution of diversity 8 2 Within-species diversity - analytic and simulation results and 23 comparison with experimental data

3 Details of comparison with experimental results 44

4 Dynamics and genealogy of strains in spatially extended host- 51 pathogen models

5 Long-range interactions and evolutionary stability in a predator- 76 prey system

6 Related work 86 7 Conclusion and future work 94

6 Overview

The mechanisms that give rise to the enormous variety found in natural systems are of great inherent interest. What causes the enormous variation we see in nature? Why are some species and environments highly diverse, and others less so? How can we characterize the diversity that exists? How does diversity affect evolution? These questions have been studied since Darwin's Origin of Species and before, but this thesis presents a new approach to aspects of this problem based on analyzing and simulating properties of evolving genealogical trees of populations. Chapter 1 gives an overview of the results on diversity. Chapter 2 presents the results covered in Chapter 1 in detail, and gives additional results. I also compare these results with experimental genetic data on microbial populations, showing that the distribution of diversity within populations and fundamental property relating to the shape of genealogical trees both match the data. Chapter 3 details this comparison. In chapter 4 I apply the method of dynamically tracing the genealogical tree of a population to further questions in evolution. In many systems, organisms modify their environment, which in turn affects the evolution of the organisms, but the effects of this are not yet well understood. I show that such systems can be better understood by considering evolution as the dynamics of strains (groups of genealogically related organisms) rather than individuals. In chapter 5 I use the methods of chapter 4 to explore the effect of local and long-range interactions on such evolutionary systems. Chapter 6 reviews existing work related to this thesis, and chapter 7 concludes and presents potential future work.

7 Chapter 1: Scaling, dynamics and distribution of diversity

Abstract

Here we introduce the method of modelling the evolving genealogical tree of a population, and show how it can be used to study spatial and temporal patterns of diversity. These results are given in more detail in Chapter 2.

Why diversity is important

Evolution is the phenomenon of a population of interacting individuals changing over time. We usually think of evolution as happening in biological systems, but the concept of evolution can be applied more broadly to other complex systems as well. These systems have the property that they are made up of individual elements, each with possibly unique characteristics, with new elements arising or replacing old elements in the following way: some other element or elements are copied, but occasionally small changes are introduced into the copy. The small changes can be thought of as potential innovations. The more times an element is copied, the more different it becomes from the original. For example, in a biological context, the small changes are mutations and the copying process is reproduction with inheritance. (The 'copy' does not have to be of a single individual; in sexually reproducing organisms, it is a combination of two individuals). There are several ways for interaction to take place: it can occur directly between individuals (for example, through predatator-prey interaction), but more commonly it takes place through the environment. In the simplest way to account for this, individuals can be thought of as "replacing" others because limited resources support a finite population. A population evolving according to this process explores the space of possible states that the individuals can have. A fundamental way to characterize such a pop- ulation is its diversity - a measure of how much of the state space it covers. Diversity is fundamental to adaptation, which is one of the most important charac-

8 teristics of evolving systems. In complex systems, the individuals are generally coupled to a complex environment. To be suited to a complex environment, the individuals' state space must be large. To discover adaptive parts of this state space, it is more ef- fective if the population explores it in parallel, that is, if it is spread out over a region of the state space rather than be concentrated in one part of it. The more variation exists in a population, the faster it can change[1]. Furthermore, complex environments are usually dynamic, and in order to remain adapted, the population must change in response to changes of the environment. This requires that it be flexible (that is, capable of a wide range of responses), because these new conditions may never have been experienced by the system before. However, populations generally cannot change all at once. Any 'solution' must usually start as an innovation in a single or small number of individuals, and adaptation at the level of the population happens when these innovations spread. Potential solutions could conceivably be generated by modifying existing ones when they are needed; however, when the state space is large, it generally takes a long time to discover an adaptive solution from scratch - possibly too much time for the response to happen quickly enough. Although in the immune system, variation is generated only when it is needed, evidence from [2] suggests that response to change is usually more effective when there is an existing pool of variation from which adaptive innovations can be drawn. This implies that effective adaptation to change also depends on diversity.

Model of diversity in evolving systems

I will now choose a simple state space for individuals in order to illustrate how the properties of diversity can be obtained from the genealogical tree of a population. It is chosen only for concreteness, and this method does not depend strongly on what the state space is as long as it satisfies a certain basic property I will give. Consider the state of an individual to be represented by a string of numbers, each standing for a char- acteristic (trait) which can change separately from the others. The string of numbers could represent a genome, for example. For simplicity, I will restrict each to be a 0 or 1. Mutation occurs when a single bit is flipped during the copying process. In the next generation, this new, mutated individual can also be copied, and further mutations can be introduced (Fig. 1). In this way, the descendants of a single individual become successively more different from each other. To measure the difference between two individuals, we need a distance metric. In the case of bit strings, we can use the Ham- ming distance - the distance between A and B is the number of one-bit changes needed to transform A into B. (This is similar to the measures used in genetics.) If we make

9 (00 0 00 0 00)

(0100 00 0 (00 00 00100) distance 4

(010100) -distance 40 (10000100)

Figure 1: When small changes can be introduced on copying, the descendents of an individual become successively more different from each other. A line represents a -offspring relationship, with time going down the page; there is one mutation per generation. the assumption that mutations are random and happen at a constant rate, then every link from parent to offspring represents a chance for a mutation. Therefore, the expected distance between two individuals is proportional to the number of links traced back until their common ancestor is reached. Diversity is a measure of how much potential variation is actually found in the population - that is, the amount of state space covered by the population. There are several ways to measure this. One can simply count the number of different types in the population. This is the measure most often used in studying biodiversity at the species level: one simply counts the number of different species represented. How- ever, descendants that accumulate changes become successively more different from their ancestors and relatives over time, and counting the number of distinct types does not account for this. A number of individuals that are very similar to each other (close together in state space) would be measured as having equal diversity to the same num- ber that are spread out more over the space. This can be seen in figure 1, where the additional diversity caused by additional mutations in the second generation are not counted by this measure. In order to account for differing distances between individu- als, the diversity measure should have the following property: any mutation that arises that is not already found in any member of the population should increase the diversity. A measure that satisfies this criterion is the number of positions at which both 'O' and '1' are represented in the population. This is similar to measures used in genetics[3]. Just as we did for two individuals, we can trace back the ancestry of a whole pop- ulation. This produces a genealogical tree, and each parent-offspring link in this tree is a chance for a mutation. Therefore, the expected diversity of the population is pro- portional to the number of links (the total branch length B) of the tree assuming each mutation is different. In fact, assuming that mutations happen at a constant rate, we can ignore the individual states themselves entirely, since what we are interested in is cap-

10 Ie+05 ,, , , , x 80000

60000- D(B) 40000 20000

0 20000 40000 60000 80000 le+05 B

Figure 2: The total diversity D of a population as a function of the total length B of the branches of its genealogical tree. Two cases are compared: one in which the state space (107 bits) is large enough compared to the mutation rate (it = 1 mutation per generation) that diversity is roughly linear, and one with a smaller state space (106 bits), where identical mutations tend to happen in different parts of the tree when the tree is large enough. Since we do not count these duplicate mutations in the diversity, the rate of increase of D with B slows down.

tured by the genealogical tree. From here on, I will no longer consider the underlying state of the individuals, as the properties we are interested in can be obtained from the genealogical tree itself. The results will thus apply to any state space and form of muta- tion for which the distance between an ancestor and its descendants grows linearly with the number of generations. (If the state space is small enough relative to the mutation rate, the same mutation may appear independently in different individuals. We should not count these duplicate mutations in the diversity, so the total diversity will saturate as we consider larger and larger trees - the larger the tree, the greater the chance that the same mutation is found more than once in the tree. This method also applies in these cases, however, because there is a simple way to take this into account - Figure 2 shows the relationship between the branch length and the actual diversity). In order to determine how the diversity evolves, we need to know which individuals will reproduce, and which ones will be removed or replaced. In a real system, this may depend on the characteristics of the individual, some being more adapted than others. In evolutionary biology, this dependence is called selection. However, much work in mathematical genetics has shown that many important properties of diversity can be understood by assuming that the reproduction process does not depend on the individuals' state. Using this assumption allows us to use simple models to predict properties of the diversity of populations. In the next chapter, I will show ways in which selection can be added.

11 I will add random reproduction to the model as follows. At every time step (gener- ation), all the individuals in the population are replaced, and each new individual is the offspring of an individual chosen randomly from a subset the previous generation. An important consideration is where an innovation can spread once it arises. Many bio- logical systems tend towards local dispersal. For microbes in the soil, most plants, and many terrestrial animals, offspring are generally located near their . Long-range transmission is possible, but rare. Other biological systems combine local and long- range dispersal. For example, most spores land close to the organism that produced them, but they can be carried far away by wind or water. Marine organisms also show this mixture of local and long range, with mixing in the aquatic environment acting against locality. This important aspect can be captured in the model as follows. We can think of the population as having a fixed number A of "places" (sites), each of which is occupied by an individual. A copy can spread from one site only to others that are connected to it. These connections form a network; at one extreme, when innovations can spread anywhere (that is, when the population is well-mixed), we have a fully-connected net- work (Fig. 3a). At the other extreme, an innovation can be transmitted only locally; this connectivity can be modelled using a lattice (Fig. 3b). In this case, I will refer to the number of sites as "area." Intermediate connectivities can be modelled using a small-world network[4], which is a lattice with some long-range connections. In all cases, the individual at a site is the offspring of a random individual from the previous generation located at one of the sites it is connected to.

Evolution of the total diversity

Now that we have specified the reproduction dynamics, we can model how the diversity of a population changes over time by determining how the genealogical tree changes. New diversity continually arises when new copies, which are chances for mutations, are made. On the other hand, when an individual is removed or replaced, any of its mutations that are not shared with other individuals are lost (Fig. 4). The change in diversity is governed by the balance between the generation of variation and its extinction. When we start from a homogeneous population (or a single founder), the generation of new variants exceeds the rate at which diversity is lost through extinction, so the diversity increases over time. However, the rate at which diversity accumulates slows down (Fig. 5). Eventually, a balance between the increases and decreases is reached. The long- term average diversity, which is determined by this balance, can be thought of as the

12 (a)

T=2

T=l

T=O (present)

(b)

time

oresent

Figure 3: Model of random reproduction in a population. (a) A well-mixed population: each individual is descended from a random individual of the previous generation. (b) Example genealogical tree for a one-dimensional population. The bottom row repre- sents the currently living population. The tree corresponding to the ancestry of the currently living individuals is shown as solid lines; the ancestry of those that have no descendants in the present, and thus do not contribute to diversity, is shown as dashed lines. At the arrow, a lineage goes extinct, causing the loss of any accumulated differ- ences that have arisen on the line of descent from A, the most recent ancestor that has descendants in the present.

13 A B C D A C D

Figure 4: Gains and losses of diversity. After one generation, individual A has had one offspring (adding one unit of branch length), B has had two (adding two units), and D had one (adding one). C had no offspring in the current generation, so the two units that have accumulated since it diverged from the rest of the population are lost. Total branch length increases by two units.

'ANNN1. I5051M

B 10000

5000

' ' ' ' 0S 250 500 750 1000 T

Figure 5: Increase in diversity of a two-dimensional population with initially low diver- sity, before it reaches its long-term average. The dotted line corresponds to the analytic result B(t) ~ A(log(t))2.

14 le+05

B

10000:-

1000-

10 100 1000 A

Figure 6: Average branch length B of the genealogical tree of a two-dimensional population simulated for 500,000 generations (squares) as a function of number of sites A (which can be interpreted as habitat area), compared to the analytic result B = A(Iog(A)) 2 . Also shown is the branch length for a well-mixed population (cir- cles) compared to the analytic result B = A log A. capacity for diversity of the system. This balance can be upset by perturbing the popu- lation. For example, part of the population can be killed off, or the population may be replaced by the descendants of one highly adapted individual in a short time. However, after the perturbation, the population will return to the diversity capacity. The capacity depends on two things. First, the size of the population is very im- portant, since the larger it is, the more opportunities for novelty there are. Second, diversity depends on the structure of the population (here modelled as the connectivity of the network). In general, the further offspring can be from their parents, the lower the diversity. Smaller dispersal distances or barriers increase diversity. This can be seen from Figure 6. Figure 6 also shows that the diversity of a spatial population, in addition to being higher than that of a well-mixed population, also grows faster with area. For exam- ple, at N = 100, the population has 6.3 times the diversity it would have if it were well-mixed, but at N = 2500, this has grown to 9.1 times. The ratio depends on the specifics of the model, but we can quantify, in a more general and robust way, the effect of a property on diversity by expressing how diversity scales with that property. For a well-mixed population, diversity scales as A log A[3]. For a two-dimensional popula- tion (Figure 6) it scales as A log 2 A. Though the logarithmic factor is not a dramatic

15 difference from the well-mixed case, the difference can grow appreciably with popu- lation size, as the example shows. For a one-dimensional population (as illustrated in Figure 3a), the difference is much larger: the branch length scales as A2 . (There are several kinds of biological populations that have effectively one-dimensional habitats, such as those that live along coasts or rivers). These scaling results are not affected by the details of the model, such as how dispersal happens or whether multiple individuals can exist on a site. If properties of the system change, so will the diversity capacity. A decrease in population size will, of course, decrease the diversity. However, decreases do not nec- essarily happen because of the removal of individuals. For example, the connectivity may change, by introducing long range dispersal to a population which is locally con- nected. At the time of the change itself, diversity is not affected immediately, but decreases to the new capacity over time.

Diversity is unevenly distributed

Up to this point, we have been concerned the total diversity of a population. When we look at the genealogical tree of a simulated population (Fig. 7), we see that there are branches containing a minority of the population that diverged early from the rest of the population, such as branch A in Fig. 7. Within each particular branch, the same pattern can be seen, such as in the two subgroups of branch D. Since these groups have been separate for a long time, they have had time to accumulate mutations that make them distinct. Thus they will contain a disproportionate share of the total diversity in the population. This uneven distribution is a typical characteristic of populations. We can quantify this by considering the uniqueness of a group. This is the number of generations since its common ancestor with its most closely related group, which is proportional to how many mutations it has that are not shared with any other group. A particular individual is a member of multiple overlapping groups, so we define a group to consist of all indi- viduals whose genealogical distance is less than Tg from the others. (This corresponds to drawing a horizontal line on Fig. 7 at a height of Tg generations, and taking each of the distinct subtrees below the line to be a group). The uniqueness has a power law distribution, shown in Fig. 8. The important features of such a distribution are that very large values can occur (the distribution has a "long tail"), and that a disproportionate fraction of the total diversity is contained in a small fraction of the population. The same distribution is found whether the population is well-mixed or spatial. Thus, the uneven distribution of diversity is a general property of such systems.

16 Ak

Figure 7: An illustration of the uneven distribution of diversity: the genealogical tree for a two-dimensional population of 130 individuals. Group A diverged from the rest of the population early and so has had more time to accumulate unique mutations. It carries more of the population's total diversity than any one of groups B, C, or D, and so it accounts for a disproportionate fraction of the diversity. The most recent common ancestor is 472 generations ago.

17 0.1 ".- 0.01 3a - 0.001- 0.0001 losses le-05 P(u) le-06 le-07 - le-08 - le-09 _ uniqueness

1e-10 -well-mixed le-11 - spatial le-12r-, , , a ,,_ r- 0 0 0 0

- - 0 +. U- -

Figure 8: The distribution of diversity in populations. The plot shows the distribu- tion P(u) of genetic uniqueness of individuals, in a well-mixed (small symbols) and a two-dimensional population (larger symbols). The horizontal axis represents unique- ness u in generations of divergence. The distribution can be fit by a power law with exponent -2.8 (dotted line). The distribution for T. = 1 (that is, single individuals) is shown, but the same distribution also applies to groups with T > 1. Also shown is the distribution of the sizes of of losses of branch length in a single time step. Because essentially randomly selected individuals die off at each time step, this has the same underlying distribution as uniqueness. The flattening of the curve for small losses re- flects the effect of averaging several individuals that die off in the same time step. Data is logarithmically binned.

18 1. 0.1 0.01 0.001 0.0001 le-05 P(u) le-06 ' le-07 - le-08 le-09 le-10 le-l1

le-12 -

U -

Figure 9: The effect of inheriting from multiple parents on the distribution of unique- ness in a population. There are assumed to be g separately inherited parts of the indi- vidual's state. For simplicity, each is assumed to be inherited from a random neigboring parent. (The effect of would be somewhat less than shown here, since each unit can be inherited from one of only two parents). From left to right, g = 5, 20, and 100.

When a new individual is a combination of multiple individuals (for example, in sexual reproduction), the uniqueness of the offspring is an average of the uniqueness of the contributions of the parents. This reduces the unevenness of the diversity dis- tribution, because it is less likely that both parents are highly unique than that a single individual is highly unique. However, because of the long tail of the distribution, a large uniqueness in traits inherited from one parent is enough to make the whole individual highly unique. Fig. 9 shows that separately inheriting larger numbers of units (such as parts of chromosomes) makes the distribution tend more towards a Gaussian distribu- tion. However, even when the number of separately inherited units is relatively high, the power law tail of the distribution remains, though it disappears when the number is very large.

19 Diversity undergoes large fluctuations

Every generation, some individuals die or are replaced without leaving any offspring. This means that all their unique mutations are lost. We have seen that diversity becomes concentrated into small subgroups. Such a group could become the ancestor of the whole population, but it is more likely that it will eventually die off. When these subgroups die off, the result is a sudden decrease in diversity. Fig. 10 shows that these decreases can be very large, and cause the diversity, even when the population has reached a steady state, to continually fluctuate. Diversity builds up gradually through the accumulation of mutations, but is "dissipated" in these decreases. Fig. 8 shows that distribution of the sizes of these decreases is a power law. Because the individuals that die off without offspring are essentially randomly chosen, this distribution is essentially the same as the uniqueness distribution of individuals. Thus, large fluctuations in diversity happen even when the population is not per- turbed by external forces. Since the decreases have a power law distribution, they have no characteristic time scale or size. The time between a loss of diversity of at least u scales as u2 . In the simulation, 10% of the diversity is lost in a single generation every 3,700 generations on average; a quarter is lost every 12,600 generations, and half the diversity is lost every 218,000 generations.

Spatial distribution of diversity

Another way to look at the distribution of diversity is how it is distributed over space. When reproduction is local, nearby individuals tend to be related to each other, and as Fig. 11 shows, subpopulations (corresponding to branches in Fig. 7) are concentrated in particular regions. This leads to boundaries between divergent subpopulations. An observer might be tempted to conclude that there are differences in the local conditions on either side of the boundary which caused the population to diverge because the subpopulations on each side have specialized, or that the boundary is due to a barrier to dispersal. Alternately, an observer might conclude that boundary is there because one of the subpopulations migrated from somewhere else. However, these boundaries are a consequence of the local reproduction process only, and do not require selection or unusual historical events in order to form. When inheritance is from multiple parents, each set of traits inherited separately has its own such pattern of diversity.

20 B

420000

20000 - 10000-

0 2W0 5000 0 50000 le+05 t

Figure 10: Diversity in populations, even when they have reached a steady state, under- goes large fluctuations. The plot shows a time series of the diversity of a spatially dis- tributed population. The diversification phase, during which the population approaches the diversity capacity, lasts roughly until time t = 20, 000 generations. Thereafter, the average remains constant but there are losses of all sizes up to about 2/3 of the total branch length. The distribution of these losses is shown in Fig. 8. The inset shows that the time series of a well-mixed population is similar.

U;.. V

U d6

-6

I 16

Figure 11: Simulated spatial pattern of diversity in a two-dimensional space (repre- sented as a lattice of I x 1 sites with 1 = 50). Inheritance is from one parent. In the left panel, the individuals shown in grey or black represent a genealogically divergent group whose most recent ancestor in common with the rest of the population (white) lived about 10,000 generations ago. Gray and black represent related branches that di- verged about 2,000 generations ago. The four panels at right show the most divergent groups in the later evolution of the same population (in black, subgroups not distin- guished), every 2,000 generations starting at t = 20, 000.

21 References

[1] R. A. Fisher, The genetical theory of natural selection (Oxford University Press, London, 1930);

[2] R. Frankham, K. Lees, M.E. Montgomery, P.R. England, E.H. Lowe, D.A. Briscoe, 1999. Do population size bottlenecks reduce evolutionary potential? Anim. Cons. 2, 255-260.

[3] Watterson, G. A. On the number of segregating sites in genetical models without recombination. Theor.Pop. Biol. 7, 256 (1975).

[4] Watts, D.J., & Strogatz, S.H. (1998). Collective dynamics of "small-world" net- works. Nature 393, 440-442.

22 Chapter 2: Within-species diversity - analytic and simulation results and comparison with experimental data'

Abstract

In this chapter we give analytic arguments and simulation results for the results stated in the previous chapter, and investigate additional properties of diversity. We discuss the relevance of our findings to biological systems, and show that two predictions from the model - the number of ancestors of the living population at a given time in the past, and the distribution of uniqueness - agree with existing experimental data on global samples of Pseudomonas bacteria.

The study of diversity is central to biology, and the mechanisms that give rise to the enormous variety found in nature are of great inherent interest. Understanding diver- sity also has practical value for the conservation of biological resources [2] since it is important to the resistance of a population to disease and its adaptability to environ- mental changes [3,4]. DNA sequencing is increasingly being used to probe the genetic structure within species, providing opportunities for comparison with experiment. The genetic structure of populations has been studied using coalescent theory and related methods [5-9], including studies of subdivided and continuous populations [10, 11], and simulations have been used to study particular models of spatial populations [12]. Here, we focus on the scaling dependence of population properties - including diversity, its distribution in the population, and the sizes of losses of diversity - on key variables such as area and time. The scaling behavior is robust to many variations in the model, and provides predictions that can be compared with experiments. We first present the model in detail and show that properties of diversity can be obtained by modelling the ancestry of the population as a coalescing random walk. We use this to derive the distribution of genetic distances between individuals in the population, showing that a spatial population has a power law distribution, which is fundamentally different from well-mixed populations, which have an exponential distribution. By

23 deriving the scaling of diversity with habitat area, we will demonstrate that area has a stronger effect on within-species diversity than on species diversity. We will then give results for the power law distribution of genetic uniqueness and of the sizes of fluctuations in diversity.

Model. Our lattice model for the population itself is similar to the stepping stone model [13]. The essential features of spatial populations are locality and a form of competitive interaction that results in local limitation on density. A step of the simulation consists of the birth of a new generation. At each site, new individuals are born, and we identify the parent of each organism as being either a previous organism at that site, or one at one of the neighbors (Fig. 3 in Chapter 1). Local competition is included by allowing only a fixed number nmax of individuals to exist on a given site. For simplicity, we will take nmax to be I unless otherwise noted. A well-mixed simulation is similar except that each organism can be the offspring of any parent in the previous generation (the Wright-Fisher model [14, 15]); the population remains of constant size N. Mutation will not be included in the model directly; instead, we can superimpose assumptions about mutation on the properties of genealogical trees to obtain genetic diversity. We will first make the assumption of neutrality (that natural selection plays a negligible role); thus, the individual at any site is equally likely to be the offspring of an individual at any connected site in the previous generation. We will then characterize conditions under which selection alters the results. We will also first treat the case where individuals have only one parent, which applies to genes, organelle DNA, and asexual organisms, and then use these results to obtain approximate results for sexual populations.

Genetic distances between individuals. Consider an individual on the lattice at time T = 0 (the present); we will use T to stand for the number of generations before the present. The sequence of its ancestors can be traced backward in time, with the parent always located at one of the lattice sites connected to that of its offspring. The sequence of locations of the ancestors of an individual is thus a random walk. Two such walks stepping onto the same site corresponds to two individuals having a common parent; so, for two individuals x, and X2 at different locations on the lattice, the time until their common ancestor is distributed as the first intersection time of two random walk- ers. (Other assumptions such as different dispersal distances or migration rates, lattice geometry, overlapping generations and number of occupants per site do not change the scaling properties). If mutations are random, the number of mutations occurring since the lineages of x, and x2 diverged is binomially distributed with mean proportional to the number of generations.

24 0.11

I D 0.01

-2D 0.001 P(

0.0001

well-mixed I e-05

Ie-06, 10 100 1000 10000 T

Figure 1: Distribution P(T) of pairwise genealogical distances in a two-dimensional population and a one-dimensional population (L = 150), and the analytic result P(T) = _( - )T 1 e-T/N for a well-mixed population (N = 150). Here and in subsequent figures, the two-dimensional lattice size L = 30 unless otherwise noted.

It is known that in spatial populations a typical pair of individuals is more dis- tantly related to each other than in well-mixed populations [14]. In well-mixed pop- ulations, the number of pairs at a given genetic distance from each other has an ex- ponential distribution [5]. In spatial populations, by contrast, we find that the ge- netic distances have a power law distribution (Fig. 1). In a well-mixed population, there is a probability 1/N of two individuals having the same parent. With probability 1 - -, these individuals did not have the same parent, but with probability (1 - ) they had the same grandparent; and so on. A(T) is thus exponentially distributed: A(T) = y (1 - ±)T-z eT/N. In spatial populations, the distribution of ge- netic distances is a consequence of the distribution of first intersection times of random walks averaged over pairs of starting points.

Shape of the genealogical tree. Now consider the entire population of individuals on the lattice. The ancestry of this population can be described by a spatial genealogical tree, which we will use to obtain properties of diversity. Fig. 2 shows the genealogical tree of a one-dimensional population. The lineage of each member of the current pop- ulation executes a random walk. Two walkers (lineages) coalesce into one when they collide. The number of lineages L(T) is the number of individuals that lived T gen- erations ago that have descendants in the present. L(T) decreases with time into the past as individual lineages coalesce, and is equal to the number of remaining walkers

25 Figure 2: Spatial genealogical tree for a one-dimensional population of 300 individu- als. The y-coordinate represents time, with the present generation at the bottom. The ancestors of the current population are shown, with the x-coordinate representing their physical position. At left, the full tree is shown; the common ancestor is 9248 genera- tions before the present. At right, the most recent 300 generations are enlarged.

26 in the coalescing walk. We will derive this function for well-mixed and spatial popu- lations, compare it with experimental genetic data, and then use it to derive the scaling of diversity with habitat area. The scaling of L(T) in well-mixed populations [15, 16] can be understood as fol- lows. In a genealogical tree, two lineages coalesce if two individuals have the same parent in the previous generation. In the Wright-Fisher model of a well-mixed popula- tion, each lineage can be thought of as jumping to a random site on each time step (time being measured as number of generations before the present), with multiple lineages on the same site coalescing. The number of lineages L(T + 1) in generation T + 1 is therefore the number of distinct individuals in L(T) samples from the set of ancestors in generation T + 1 with replacement. Let p(T) = L(T)/N be the fraction of individ- uals in generation T that have descendants in the present. For p(T) sufficiently small (that is, when T is not small) we can neglect all but pairwise coalescence. The prob- ability that two lineages coalesce into the previous generation is p(T) 2 when p(T) is not large, with a resulting change in p(T) of Ap ~ -p 2 /2 (the factor of 1/2 arises be- cause only one lineage is lost for a pairwise coalescence [17]; still, the scaling behavior depends only on the proportionality of change to p2 and not on the coefficient), which in the continuous limit is = -p 2 /2, giving p(T) = 2, i.e. the scaling behavior:

p(T) ~'

In a spatial population, L(T) can be expressed as p(T)Ad, where d is the popula- tion density, A is the habitat area, and p(T) is the fraction of individuals in generation T that have descendants in the present. p(T) is the probability that a given site is occupied at time T in a coalescing random walk starting with all sites occupied. Ex- pressions for p(T) in different dimensions have been derived [18]; in two dimensions, p(T) - 1/vi/t and in two dimensions, p(t) ~ log(T)/T. Thus, the scaling behavior of L(T) with time for a spatial population differs from that of a well-mixed population by a logarithmic factor, which is a relatively small distinction when compared to the quite different scaling behavior of the distributions of pairwise genetic distances.

Comparison with experimental data. An experimental test of this model is shown in Figure 3. L(T) was obtained from a clustering analysis of Pseudomonas bacteria [19] sampled at different sites around the world. The experimental data is long-tailed, and this tail is well approximated by both the well-mixed and spatial prediction. It is important to emphasize, however, that the original theoretical result is the result expected for the genealogical tree of the entire population rather than a limited set of samples of the population, as was obtained in the experiment. We thus performed

27 theoretical calculations using simulations (shown in the figure) that directly represented the sampling of the population at specific geographical locations used in obtaining the Pseudomonas data. To model the sampling of the population, we directly represented the organisms at the specific geographical locations where the Pseudomonas samples were obtained. In the simulations, lattice cells represent a physical region of the earth. 248 lineages were placed at sites whose cells contain the latitude/longitude coordinates of the 248 samples in the Cho and Tiedje data. At each step of the simulation, moving backward in time, a lineage performs a random walk staying in place or moving to a neighboring site. At its destination, with a certain probability pc, it coalesces with other lineages at that site. Pc and the number of simulation time steps NT corresponding to one unit To of biological time, as well as the lattice size, are adjustable parameters. The parameters were set by a simple fitting procedure that adjusts the intercepts of L(T) at the L and T axes but does not affect the shape of the curve. The parameters were adjusted to simultaneously fit both L(T) and a different property of the same genealogical tree: the distribution of genetic distinctiveness, that is, the number U(T) of samples whose most closely related sample diverged from it T generations ago, as described below. No special allowances were made for earth curvature or topography (e.g. oceans) that might affect the last few points, which represent global dispersal of only a few most ancient lineages. The result is the solid line in Fig. 3. To understand this curve, we note that the deep part of the genealogical tree (long times) corresponds directly to the scaling result given for the full population (thin line in Fig. 3). This result is consistent with the recognition that the samples taken from all over the world should be a complete representation of the deepest part of the ge- nealogical tree. An extrapolation of this curve to T/To = 1 gives a rough estimate of the total diversity of the population, in terms of the number of genotypes that could be distinguished at the level of resolution of the data (a genomic similarity value r = .95), suggesting on the order of 1,000 genotypes. For the part of the curve corresponding to recent times, the sampling should underrepresent the number of lineages in the ge- nealogical tree. This can be seen the figure as values of L(T) which are lower than the scaling result for the full population for short times. The degree to which the sampling underrepresents the tree, however, depends on the spatial structure of the population. Although a simulation of a well-mixed population also matches the data from sam- ples taken from around the world, it does not match a spatially correlated sub-sample. The lower data points (diamonds) in Fig. 3 correspond to a sub-sample consisting of the samples taken in California. A comparison with what would be expected for a completely mixed population (dotted line in Fig. 3) shows it is inconsistent with the experimental results for the spatially correlated sampling. Therefore, the overall

28 (a) 1000

0 L(T/To)

10: io0-

I10 100 1000 T/1O

(b)

L(Tfr,) *e. 10-

0

Figure 3: (a) Number of lineages as a function of time in the past L(T), comparing experimental data (circles) and a spatial simulation of the sampled population (solid line). The dotted line corresponds to theory for the whole population. The L-intercept of this line provides a rough estimate of the genetic diversity of the entire population, that is, on the order of 1000 genotypes that could be distinguished at the level of reso- lution of the data. (b) A comparison of the spatial and well-mixed cases. L(T) for a subset of the sampled population consisting of the samples from one geographic region is shown as stars, with a comparison of a spatial (dashed line) and well-mixed (dotted line) simulation of the subset. The experimental data (circles) and simulation (solid line) for the full population is as in part (a). The spatial simulation matches the data for the spatially correlated sample more closely, indicating the importance of spatial structure. Experimental data was obtained from Cho and Tiedje [19]. Time T in the past (the current generation being at T = 0) is normalized by dividing by To, the time to the smallest genetic difference considered. The number of simulation time steps NT corresponding to one unit To of biological time is 160 and the coalescence probability p, is 0.15.

29 diversity of the California samples is lower than it would be if the population were well-mixed. A spatial simulation of the ancestry of the California samples alone, using the same parameters as for the full simulation, matches the experimental data for the California samples. This effect of correlated spatial sampling occurs only for the case of a spatially structured population, and not for a well-mixed population where local samples and random samples would be equivalent. The results confirm the importance of spatial structure to the Pseudomonas global population. The results of spatial theo- retical calculations closely match the experimental data over its full range. Details of our method are given in the appedix. Scaling of diversity with area. Area is an important factor in biodiversity. The area of habitat available to organisms has been found to be a primary determinant of their diversity above the species level, as measured by the number of species [20, 21]. Ex- perimental results have found that the number of species S in a sample area A scales as S = AZ on most scales, with z typically 0.25 and ranging from 0.15 to 0.4 [20,21]., and this scaling has been modelled theoretically [20,22]. We will now use the scaling of L(T) to derive the scaling of the total diversity within a population. Geographically limited dispersal has been found to increase diversity in many species [23] (includ- ing human Y-chromosome [24] and mitochondrial DNA [25]). Geographic differen- tiation has been found in viruses [26], bacteria [19], plants [27], trees [28], inverte- brates [29], amphibians [30], mammals [31] and birds [32], both in two-dimensional and one-dimensional [33] habitats. Every parent-offspring link in the ancestry of the living population represents a chance for mutation. Therefore, diversity is a function of the total length of all branches of its spatial genealogical tree. A key difference between species diversity, as measured by the number of species, and genetic diversity is that the former treats all species as equally distinct, not considering the degree to which species are different from each other. The measures used in this paper, by contrast, count additional mutations along a lineage that make a descendant progressively more different from its ancestor and relatives. We can obtain the branch length B by summing the number of lineages over all generations in the past up to the expected time TA of the most recent common ances- tor of the population. Any mutations that occurred before the most recent common ancestor will be shared by the whole population and will not contribute to diversity. Thus,

B =ZTA L(T).

TA is the expected time at which there is only a single lineage, and is obtained by set-

30 ting L(TA) = 1. Any individuals that do not have descendants in the present are not counted in B, thus excluding mutations that have become extinct. Under the assump- tion that the same mutation is not likely to arise more than once in the ancestry of the population (sometimes known as the "infinitely many sites" assumption), the genetic diversity measured as the total number of distinct mutations is proportional to B. If this assumption does not hold, we must ensure that mutations that are found more than once in the tree are counted only once; thus the expected number of distinct mutations in the population is D(B) =a " (1 - e-LB), where 1 is the per-genome mutation rate and pL is the probability of a particular mutation arising at a particular locus. (This is akin to the Jukes-Cantor correction for estimating the divergence time between two sequences, but in reverse [34]). Fig. 2 in Chapter 1 shows that diversity is roughly proportional to branch length for a range of plausible mutation rates, but saturates for high mutation rates (or small genomes). In two dimensions, L(T) is proportional to A*I , implying that TA scales as A log(A), so the branch length scales as:

B(A) ~ A[log(A) + log(log(A))]2 - A(log(A)) 2

Thus, for a two-dimensional habitat, B grows somewhat faster than area, but by a relatively slowly increasing factor of log2 (A). Still, this implies a much faster increase with area for genetic diversity than has been measured for species diversity. Fig. 6 in Chapter 1 shows that this function well approximates the average diversity of a simulated population. We can also consider an effectively one-dimensional habitat, whose topology cor- responds to a number of natural habitats such as rivers and coastal or tidal zones, if the width of the habitat is not much more than the dispersal distance. In one dimension, L(t) ~ . The expected time of the most recent common ancestor TA ~ A 2, and so the branch length scales as:

B(A) ~ A L' dT ~ A 2

Thus the total branch length in one dimension grows much faster than length or pop- ulation size; it quadruples when length or population size is doubled [35]. This is different from well-mixed populations, whose diversity scales as N log N, where N is the population size [5]. These results assume that dispersal occurs to a neighboring site. When there is longer-range dispersal (e.g. with Gaussian dispersal distances), and local populations are effectively well-mixed within the dispersal distance, the same scaling results should

31 be expected, as long as the dispersal distance is significantly smaller than the size of the habitat. This was confirmed by simulations. When some dispersal happens at long distances, a transition between spatial and well-mixed behavior is expected. The above has also assumed that there is no recombination; in sexual populations, the tree for different portions of the genome (such as genes and organelle genomes) may be different. However, the diversity-area relationship is essentially the same in sexual and asexual populations. To illustrate the effect of recombination, consider the following hypothetical extreme case: each individual has g separately inherited units of the genome, each inherited independently from a random neighboring individual in the previous generation without linkage. Each unit has its own genealogical tree, and total diversity is the sum of the contribution of each unit. Thus, the total is the same as in an asexual population with the same per-genome mutation rate [36]. Another measure of diversity is the average genealogical distance between individ- uals, sometimes known as nucleotide diversity [37]. In nonspatial and spatial popu- lations alike, the scaling of this quantity is the same as the scaling of the most recent common ancestor of the whole population. In a well-mixed population, the average ge- nealogical distance between members of the population is Aavg = fOT A(T) = N. In a spatial population, the average genealogical distance Aavg between members of the population is the first hitting time of a pair of random walkers, averaged over all walks and pairs of locations. This quantity scales as N 2 in one dimension and NlogN in two dimensions [38].

Founding and perturbation. The dependence of diversity on habitat area or pop- ulation size implies that habitats have a diversity capacity. A population in which mutation and extinction are in balance in the long term will have a long-term average diversity equal to this capacity, though the genetic makeup of the population will be constantly changing. (A similar concept applies to species richness [22]). Diversity may be lower than the capacity of the habitat due to recent founding of the population or to perturbations that kill off part of the population. A population with initially low diversity increases in diversity until its average reaches the diversity capacity. In a spa- tial population, we can assume that the size of the population increases to its long-term average in a time much shorter than TA, the time to most recent common ancestor in a non-perturbed population, since the time to populate an unoccupied habitat scales as A1/ 2 in two dimensions (given a uniform rate of spreading) whereas TA scales as A(log(A)) 2. (In one dimension the values are A and A 2 respectively). Given this assumption, the genealogical tree looks like that of a non-perturbed population, ex- cept its more distant history is effectively "cut off" by the diversity-reducing or found-

32 1.2e+06

I e+06 -

8e+05 -

B 6e+05 -

4e+05 -

2e+05 -

5000 I e- 5 1.5e+05 2e+05 2.5e+05 T

Figure 4: Increase in diversity of a one-dimensional population with initially low di- versity, before it reaches its long-term average TA. The dotted line is the analytic result B (t)~, A Vt. ing event. Diversity at a time of t generations after the event is B(t) ~' fL(t) dt for t < TA, giving B(t) ~ Avft in one dimension; the fraction of diversity recov- ered by time t is F(t) = Vt/A. In two dimensions, B(t) ~ A(log(t)) 2 and the fraction F(t) = (log(t)/log(A)) 2 . In a well-mixed population, B(t) ~ Nlog(t) F(t) = log(t)/ log(N). Figs. 4 compares these results to simulations in one dimen- sion, and Fig. 5 in Chapter 1 does the same for two dimensions. Thus, initially the increase proceeds rapidly but it slows down with time, and continues until full recov- ery at time t = TA. Whether events that cause the loss of most of a population's diversity affect its long-term average diversity therefore depends on the frequency of such events relative to TA; if the time between events is large compared to the recovery time, they will not decrease the long-term average. Diversity may also be higher than the capacity of the habitat, as when a population is restricted in range to part of its original habitat or moved to a smaller one. Though the diversity may remain high in the short term, it will eventually decrease to the capacity of the smaller habitat.

Effect of selection. Selection can impact diversity in at least two ways. Spatially vary- ing selection (that is, local adaptation) favors different genotypes in different parts of the habitat. This causes barriers to dispersal which will tend to increase the diversity of the whole population, though it may decrease diversity in a local area. On the other hand, spatially uniform selection, in which particular mutants can have higher fitness anywhere in the habitat, can decrease diversity, since the descendants of a mutant can take over the population in a short time, thus wiping out the population's existing diver- sity before an equivalent amount of diversity has had time to develop in the descendants of the mutant. (This is known as periodic selection or genetic hitchhiking [39,40]). We

33 define a periodic selection event to be one where the descendants of a mutant take over the population in a time less than the recovery time TA, and denote the rate of such mutations arising in an individual as AR. (Mutations that take on the order of TA generations are not distinguishable from neutral mutations.) Whether periodic selec- tion affects the long-term average diversity therefore depends on the frequency of such selected mutants arising relative to the recovery time TA; if the time tE = 1/uRN is large compared to TA, average diversity will be systematically reduced. As area increases, both N and TA increase (larger populations require longer to recover their diversity), so A may reach a size at which selection causes the rate of increase in di- versity with area to slow. A first approximation for the reduced diversity when the time between events is smaller than the recovery time is B(t = tE), where B(t) is 2 the diversity at t generations after a founding event, giving B - A[log( 1 )] in two dimensions; more rigorous and comprehensive results for well-mixed populations are given in Ref. [39].

Distribution of diversity and its fluctuation. Unlike the result on relatedness, which allows us to distinguish well-mixed and spatial populations, the following results are similar for both well-mixed and spatial populations, illustrating their highly robust (uni- versal) nature. The uniqueness u(i) of an individal, which we use to quantify how di- versity is distributed within the population, is defined as the number of generations to i's most recent common ancestor that has another currently living descendant. The mu- tations that took place since that ancestor are not shared with any other member of the population. Since the probability distribution P(u) is a power law (Fig. 8 in Chapter 1), u has no characteristic size. Its distribution implies that a disproportionate fraction of the genetic diversity is typically contained in a small fraction of the population. This distribution can be understood as follows. The probability P(U > u) that an individual has uniqueness greater than u is the probability that its lineage, traced backward through time, never exists on a site that has another lineage for all T < u. In the well-mixed model, the probability that no other lineage jumps to a particular site is:

(-_)p(T)N t-- exp(-p(T)) p(T) is approximately A, with a measured from simulations of p(T) to be 1.95, and expected analytically to be 2. This gives:

P(U > u) flu= exp(-g) = exp(-a 2F 1 )

exp(-alog(bu)) U-a

34 Thus the exponent depends on the coefficient a of the scaling of p(T). The probability density is P(u) = - , giving:

P(u) ~ U-ai u -- 2.95, consistent with both the well-mixed and spatial simulations of P(u). The scale-free distribution of uniqueness also applies to subgroups defined by a given level of relatedness T.. For each individual, define its subgroup by the identity of its ancestor T. generations ago. We define the uniqueness ug of the group to be the uniqueness of the ancestor. The genealogical tree of the ancestors of these groups have the same properties as that of the present population of individuals, only starting with a smaller initial value of L(T). Thus, their uniqueness follows the same power-law distribution. In Fig. 5, we compare these results to experimental data. The above distribution applies to the whole population. Sampling changes the distribution by making it longer- tailed, corresponding to a greater proportion of individuals which are more unique with respect to the sampled population. We obtained U(T), the number of samples whose genetic distance to their most closely related sample is T, from the same genetic data as in Fig. 3. Fig. 5 compares this distribution with a simulation of U(T) for a sampled spatial population. By adjusting the parameters of the simulation, we simultaneously fit both U(T) and L(T); the simulation is in good agreement with experimental data using the same simulation parameters as in Fig. 3. In sexual populations, uniqueness is the sum of u(i, g) over all genes g. Figure 9 in Chapter I shows the distribution of uniqueness for sexual populations for different numbers of independently inherited units. Summing the uniqueness of these multi- ple units changes the shape of the distribution, making smaller values of u rarer, and medium values of u more common relative to large ones. However, the long-tailed, power-law character of the distribution remains, and very divergent individuals can still occur in the power-law tail. For the figure we use the simplifying assumption that each unit is independently inherited from a random connected site. Each can actually only be inherited through one of two parents, which correlates the trees with each other, and genes are not inherited independently because of linkage, so the actual change in the shape of the distribution is not as great. Diversity is often characterized using measures such as Wright's FST [41], which measures the degree of differentiation between subpopulations relative to the diversity of the whole population. However, the unevenness of the distribution of genetic diver- sity among individuals is not adequately captured by FST. Divergent individuals and groups do not necessarily correspond to geographically isolated populations, and if a

35 100

10 0 U(T/To) 0 1

0 0.1

0.01" 1 10 100 T/To

Figure 5: Distribution of diversity in a population, comparing experimental values of U(T/To), the number of samples with a uniqueness of T/To, for a sampled Pseu- domonas population (circles) and average over 1,000 spatial simulations (solid line). We normalize T by dividing by To, the time to the smallest genetic difference consid- ered. Sampling causes a shallower slope than the distribution for the whole population, and this exponent is matched by the simulation. The simulation and its parameters are as in Fig. 3.

36 divergent subgroup is not identified as a subpopulation for the purposes of calculating FST, much of its contribution in the form of unique mutations will not be counted be- cause of the averaging within subpopulations. Furthermore, a single subpopulation, if it diverged long enough ago, may contribute a significant fraction of the total diversity of a population, but most of this contribution will not be recognized because of the averaging over subpopulations. For a model with N well-mixed populations arranged spatially (the stepping stone model), FST has been calculated to grow only weakly (as log(N) [42]). The uneven distribution of diversity in populations causes large fluctuations in total diversity over time (Fig. 10 in Chapter 1). Since individual death is random, the amount of diversity lost due to the death of an individual is distributed as P(u). The levelling off of the curve for small losses is because the several losses that occur in a single time step are averaged. As with the distribution of genetic uniqueness, recombination affects the shape of the distribution, but the long tail remains (Fig. 6).

Biological relevance. Although the models used here are simple, the results are ro- bust to changes such as different density or multiple occupancy of a site, overlapping generations, and, for spatial populations, different dispersal distances (as long as they are significantly shorter than the habitat size). These features do not change the scal- ing of diversity or the nature of its distribution and fluctuations. The results show that genetic diversity is very unevenly distributed in the population. Furthermore, diversity has its own internal dynamics, which are distinct from possible outside influences such as habitat change and species interactions. Increases happen only gradually, but large decreases may occur without a corresponding extrinsic perturbation. Our results imply that the genetic diversity of a population strongly depends on habitat area, supporting the growing recognition of the dramatic effects that the loss of habitat area loss has on the diversity of populations. Indeed, while the observed number of species scales as a weak power of the area, so that a reduction in area by a factor of 16 causes a reduction in number of species by a factor of 2 [20], the effect on within-species genetic diversity is much more dramatic. It only requires a reduction in area by a factor of 2 to cause a reduction in diversity by more than a factor of 2, and the effect is much more dramatic in one-dimensional habitats where the same loss of area gives rise to a loss of 3/4 of the genetic diversity. From the point of view of conservation efforts, our result that diversity is unevenly distributed within species suggests that the preservation of genetic diversity may often lie in the identification of specific highly unique individuals or groups and ensuring their continued reproduction. The relevance of valuing individuals or groups for con-

37 (a)

10000 r

1000 g=20

100

10

0.1 0

0.01

0.001

0.0001f -- 10 100 1000 10000 le+05 M

(b)

10000 =

100 g= 10

I rg=10 0.1 r8=20 0.01r 0.001 0.0001" "" 10 100 1000 100 +5 T

Figure 6: Distribution of the number of unique mutations lost per generation for dif- ferent numbers g of heritable units. No linkage is assumed; linkage causes less av- eraging and hence a smaller reduction in the size of fluctuations. g = 1 corresponds to organelles or asexual populations. (a) The number of loci per heritable unit is kept constant for all g, so the per-genome mutation rate Ip increases with g. The per-gene mutation rate is 1. (b) The total size of the genome I is kept constant for all g (p has a constant value of 1), so the per-locus mutation rate pI is set to P. 9g

38 servation by their genetic uniqueness [3,43-45] is underscored by our results. The results also suggest, however, that a highly variable diversity is a characteristic of pop- ulations in nature. Our result on the spatial distribution of diversity (Fig. I1 in Chapter 1) is relevant to explanations of the evolutionary history of species. Explanations for the occurrences of divergent populations in particular areas, and the existence of boundaries between divergent types, are often sought either in habitat variation or in a past migration event. The model shows that highly divergent populations can arise even without specific ge- ographic barriers or adaptation to divergent habitats, particularly when there is no re- combination. Though boundaries between particular lineages move and disappear after a finite time, the balance between diversification and extinction leads to the constant presence of strong genetic boundaries which are not due to past population movements or specialization to divergent habitats, but rather to historical accidents. These diver- gent populations are not necessarily confined to a single area, however; alleles can be geographically widespread in a population with strong geographic structure, and the occurrence of particular alleles in widely separated areas in an asexual or sexual pop- ulation does not imply that the population is well-mixed. Thus the spatial patterns of genetic variation that arise in homogeneous habitats in the absence of migrations or disturbances must be considered before making inferences about the properties and history of a population, particularly when analyzing mitochondrial DNA, sex-linked chromosomes or sets of genes that are strongly linked.

References

[1] Rauch, E. & Bar-Yam, Y. Scaling, dynamics, and distribution of within-species diversity (manuscript).

[2] Ehrlich, P.R. & Wilson, E.O. Biodiversity studies: science and policy. Science 253, 758-761 (1991).

[3] Amos, W. & Balmford, A. When does conservation genetics matter? Heredity 87, 257 (2001).

[4] Frankham, R., Lees, K., Montgomery, M.E., England, P.R., Lowe, E.H. & Briscoe, D.A. Do population size bottlenecks reduce evolutionary potential? Anim. Cons. 2, 255-260 (1999).

[5] Watterson, G. A. On the number of segregating sites in genetical models without recombination. Theor: Pop. Biol. 7,256 (1975).

39 [6] Hudson, R. R. Gene genealogies and the coalescent process. Oxford Surv. Evol. Biol. 7, 1-44 (1990).

[7] Barton, N.H. & Wilson, I. Genealogies and geography. Phil. Trans. Roy. Soc. B 349,49-59 (1995).

[8] Epperson, B.K. (1993). Recent advances in correlation studies of spatial patterns of genetic variation. Evol. Biol. 27:95-155.

[9] J. Wakeley, T. Takahashi, Mol. Biol. Evol. 20,208 (2003).

[10] Notohara, M. The structured coalescent process with weak migration. J. Appl. Prob.38, 1-17 (2001).

[11] Wilkins, J. F. & Wakeley, J. The coalescent in a continuous, finite, linear popula- tion. Genetics 161, 873 (2002).

[12] G. A. Hoelzer, J. Wallman, D.J. Melnick. The effects of social structure, geo- graphic structure, and population size on the evolution of mitochondrial DNA. II. Molecular clocks and the lineage sorting period. J.Mol. Evol. 47, 21-31 (1998).

[13] Kimura, M. & Weiss, G.H. The stepping stone model of population structure and the decrease of genetic correlation with distance. Genetics 49, 313-326 (1964).

[14] Wright, S. Isolation by distance. Genetics 28, 114-138 (1943).

[15] Fisher, R. A. The distribution of gene ratios for rare mutations. Proceedingsof the Royal Society of Edinburgh,50, 205-220 (1930).

[16] S. Wright. Evolution in mendelian populations. Genetics 16, 97-159 (1931).

[17] A more precise recursion relationship for the Wright-Fisher model is p(T + 1) = (1 - eP(T)).

[18] Bramson, M. & Griffeath, D. Asymptotics for some interacting particle systems on Zd. Z fuer Wahr 53, 183-196 (1980).

[19] Cho, J.C. & Tiedje, J.M. Biogeography and degree of endemicity of fluorescent Pseudomonas strains in soil. Appl. Env. Microbiol. 66, 5448-5456 (2000).

[20] M.L. Rosenzweig, Species Diversity in Space and ime (Cambridge Univ. Press, Cambridge, UK, 1995).

40 [21] Condit, R., Hubbell, S.P., Lafrankie, J.V., Sukumar, R., Manokaran, N., Foster, R.B. & Ashton, P.S. Species-area and species-individual relationships for tropical trees: A comparison of three 50-ha plots. J. Ecol. 84, 549-562(1996).

[22] S. P. Hubbell, The Unified Neutral Theory of Biodiversity and Biogeography (Princeton University Press, Princeton, 2001).

[23] J. C. Avise, Phylogeography: The History and Formation of Species (Harvard University Press, Cambridge, MA, 2000).

[24] Zerjal, T., Beckman, L., Beckman, G., Mikelsaar, A.V., Krumina, A., Kucinskas, V., Hurles, M.E. & Tyler-Smith, C. Geographical, linguistic, and cultural influ- ences on genetic diversity: Y-chromosomal distribution in Northern European populations. Mol Biol Evol 18, 1077-1087 (2001).

[25] Richards, M., Macaulay, V., Torroni, A. & Bandelt, H.J. In search of geographical patterns in European mitochondrial DNA. Am. Jour Hum. Genet. 66, 262-278 (2002).

[26] Bowen M.D., Rollin, P.E., Ksiazek, T.G., Hustad, H.L., Bausch, D.G., Demby, A.H., Bajani, M.D., Peters, C.J. & Nichol, S.T. Genetic diversity among lassa virus strains. J. Viml. 74, 6992-7004 (2000).

[27] Schiemann, K., Tyler, T. & Widen, B. Allozyme diversity in relation to geographic distribution and population size in Lathyrus vernus (L.) Bernh. (Fabaceae). Plant Syst. Evol. 225, 119-132(2000).

[28] Lee, S.W., Choi, W.Y., Kim, W.W. & Kim, Z.S. Genetic variation of Taxus cusp- idata Sieb. et Zucc. in Korea. Silvae Genet. 49, 124-130 (2000).

[29] Elderkin, C.L. & Klerks, P.L. Shifts in allele and genotype frequencies in ze- bra mussels, Dreissenapolymorpha, along the latitudinal gradient formed by the Mississippi River. J. N. Amer Benthological Soc. 20, 595-605 (2001).

[30] Rowe, G., Beebee, T.J.C. & Burke, T. A microsatellite analysis of natterjack toad, Bufo calamita, metapopulations. Oikos 88, 641-651 (2000).

[31] Goossens, B., Chikhi, L., Taberlet, P., Waits, L.P. & Allaine, D. Microsatellite analysis of genetic variation among and within Alpine marmot populations in the French Alps. Mol. Ecol. 10, 41-52 (2001).

41 [32] Baker, A.M., Mather, P.B. & Hughes, J.M. Evidence for long-distance dispersal in a sedentary passerine, Gymnorhina tibicen (Artamidae). Biol. J. Linnaean Soc. 72, 333-342 (2001).

[33] Clausing, G., Vickers, K. & Kadereit, J.W. Historical biogeography in a linear system: genetic variation of Sea Rocket (Cakile maritima) and Sea Holly (Eryn- gium maritimum) along European coasts. Mol. Ecol. 9, 1823-1833 (2000).

[34] T. H. Jukes, C. R. Cantor, in Mammalian Potein Metabolism, M. N. Munro, Ed. (New York: Academic Press, 1969).

[35] Because rivers and coastlines are often fractals, the scaling behavior should be more generally given by B(A) ~ A , where d is the fractal dimension.

[36] Since D is a measure of genetic diversity, it does not include the additional pheno- typic diversity of new combinations enabled by recombination which is important when gene interactions are present.

[37] M. Nei, Molecular Evolutionary Genetics (New York: Columbia University Press, 1987).

[38] D. Aldous and J. Fill. Reversible Markov Chains and Random Walks on Graphs. Manuscript, http://www.stat.berkeley.edu/users/aldous/book.html.

[39] Kaplan, N.L., Hudson, R.R. & Langley, C.H. The "hitchhiking effect" revisited. Genetics 123, 887-899 (1989).

[40] Majewski, J. & Cohan, F.M. Adapt globally, act locally: the effect of selective sweeps on bacterial sequence diversity. Genetics 152, 1459-1474 (1999).

[41] S. Wright. Evolution and the Genetics of Populations,vol. 1: Genetic and Bio- metric Foundations(Univ. Chicago Press, Chicago, 1968).

[42] Cox, J.T. & Durrett, R. The stepping stone model: new formulas expose old myths. Ann. Appl. Prob. 12: 1348-1377 (2002).

[43] Faith, D.P. Genetic diversity and taxonomic priorities for conservation. Biol. Cons. 68, 69-74 (1994).

[44] Crozier, R.H. Preserving the information content of species: genetic diversity, phylogeny, and conservation worth. Ann. Rev. Ecol. Syst. 28, 243-268 (1997).

42 [45] Moritz, C. Defining evolutionarily significant units for conservation. Trends Ecol. Evol. 9, 373-375 (1994).

43 Chapter 3: Details of comparison with experimental results' 2

Abstract

In this chapter, we describe in detail the comparison of theoretical results with experimental genetic data on global populations of bacteria given in the previous chapter. We address the robustness of the methods used to extract the data, and of our comparison with theoretical results on the shape of the genealogical tree and the distribution of diversity.

The data was obtained from the results of Cho and Tiedje [3] on global samples of Pseudomonas bacteria. They sampled the soil of pristine ecosystems (Mediteranean woody grasslands and Boreal forests) at ten geographic locations on five continents, taking multiple samples at each location along 200 meter transects. From these sam- ples they cultured 248 strains of phosphorescent Pseudomonas bacteria. They com- pared them by genomic fingerprinting, characterizing the genetic distances between them using digitized gel images prepared from repetitive extragenic palindromic PCR with a BOX-Al R primers (BOX-PCR) according to the protocol of Ref. [4]. Cho and Tiedje constructed a hierarchical clustering dendogram from effective distances between the digitized gel images. From their dendogram, we obtained counts of the number of ancestors at a particular effective genomic similarity (r-value) as measured by this fingerprinting technique to obtain L(r), corresponding to the number of ances- tors that existed at a time such that their living descendants have diverged to a similarity value of r. L(r) was sampled at intervals of 0.05. For r-values less than 0.45, the sam- pling resolution was doubled because of the smaller intervals between coalescences in this range. In order to obtain L(T), the r-values were first mapped onto DNA-DNA ho- mology values h using a regression curve h = V1.046 - (1.15 - r ) 2 obtained from experimental results[5], as shown in Figure 2. These results are from Xanthomonas, a genetically similar bacterial genus, whose r-values were obtained using the same protocol as the Pseudomonasresults, and whose h-values were obtained using AFLP. There is substantial noise in this data; however, this noise does not impact on the

44 comparison of the Pseudomonas data with theoretical results. Because each value of L(T) is determined by measurements of multiple isolates, the impact of noise is re- duced. Moreover, both noise in the values of r and h and variation in the transfor- mation between them has little impact on the results described in the paper, since the comparison is with the robust scaling behavior. To verify this, noise was added to the r-values of the original data with the same variance as the data in Figure 2. Figure 3 shows that the resulting L(T) values are not substantially affected. We note that the only significant sensitivity to noise is found in the region just above T/To = 1000 where the original points are least in agreement with theory and the size of the error is approximately the difference between data values and the theoretical curve, suggesting that even this disagreement may be due to noise. The effect of systematic changes in the shape of the fitted r-to-h curve were also tested and found insignificant. The h-values were then mapped onto values proportional to T using the Tajima form[6] of the Jukes-Cantor correction. In the figures, this is normalized by dividing by To, the time to the smallest genetic difference considered(r = .95)[7]. The final results are shown as circles in Figs. 3 and 5 in Chapter 2.

45 % Similatity C . - vamu r 11 ) ] *i-I EWt~ITr I -A -I

:1LLf

K-,

It r

I~1 'I 4 L1 *1 I

1 .4 -g ''I p*1 L

'-.4-

- 4 -11! Figure 1. Dendrogram of Pseudomonas bacteria showing values at which L(r) was sampled as vertical lines. 46 ...... *..

E

Z

0 20 -4) 60) W) 1WM BOX-PCR r value( 100) Figure 2. Regression curve fit to experimental data (red line) on DNA-DNA homology values as a function of BOX-PCR r-values (adapted from Rademaker et al.[5]). The large amount of noise in the data has little impact on the comparison of experimental data with theory, as discussed in the supporting online text and Figure 3.

100 - -

L(T/To) 10:- 0

0

1 10 100 1000 T/TO

Figure 3. Sensitivity of L(T) to noise. The figure shows the data extracted from the original experimental values (circles) and the data extracted after the addition of noise (plusses). Gaussian noise was added to the r-values with the standard deviation (a = .12) of the data in Figure 2. Since the r-value of a coalescence between two subtrees is an aggregate of individual data points, the error in the r-value of a particu- lar coalescence depends on the number of lineages that are coalescing, with less noise when there are more samples. For an aggregate of number of lineages m, the error in an aggregate coordinate is reduced by the factor V/ii, and the error in the relative coordinate (r-value) of two such aggregates is reduced by the square root of the geo- metric mean of their numbers of lineages (i.e. this is the error of the effective distance between them as a relative coordinate). Thus, Gaussian noise was added to the r-value of each coalescence between two branches that have m 1 and m 2 lineages respectively with standard deviation ao ( + , where ao = o/V 2 so that the noise for the r-value between two lineages is a.

47 Simulation of sampled population

Here we describe specifics of the simulation of the sampled population (solid line in Figure 3 in Chapter 2, and dashed line in figure 5 in Chapter 2) and the fitting of parameters. The lineages of the sampled population were simulated directly as random walkers, with simulated time corresponding to time in the past. Any walkers that moved onto the same site coalesced with probability pc = 0.15. Initial random walkers were placed, one per isolate, on a 50 x 25 lattice on sites whose coordinates were proportional to the latitude and longitude of samples of the Cho & Tiedje data (Table 1). In order to produce a curve that is smooth at long times, the average of 100 runs was taken. Error bars for the simulation are shown in Fig. 4.

Latitude Longitude I y-coordinate x-coordinate] num. of lineages -34.00 116.25 7 41 52 -32.38 116.12 8 41 53 40.17 -122.48 18 7 39 40.48 -122.48 18 7 39 -33.85 -70.48 7 15 8 -32.95 -71.08 7 15 5 -33.07 18.67 7 27 20 53.68 -105.37 19 10 24 54.00 -106.58 20 10 5 52.65 -102.38 19 10 5

Table 1. Coordinates of isolates in Cho & Tiedje data.

The theoretical simulations included a fitting of two parameters. These parameters do not set the key features of the comparison - the slopes of the long and short time behaviors - but rather only set the intercepts (offsets), since the shape of the curve is highly constrained by the theory. The fitting of the two parameters, shown in Figure 5, is determined by the two intercepts. The slope of the long-time portion of the curve (the scaling regime) is determined by the scaling of the number of lineages, which has no adjustable parameters.

48 100-

L(T/TO)

10 -

1 10 100 1000 T/o

Figure 4. L(T) for spatial simulation of the sampled population, showing standard deviation of 100 runs as error bars. Experimental data is shown as circles.

References

[1] Rauch, E.M. & Bar-Yam, Y. Diversity is unevenly distributed within species (manuscript).

[2] Rauch, E.M. & Bar-Yam, Y. Characterizing the amount of genetic diversity in a population by simulating its genealogy (manuscript).

[3] J. C. Cho, J. M. Tiedje, Appl. Env. Microbiol. 66,5448 (2000).

[4] J. L. W. Rademaker, F. J. Louws, F. J. de Bruijn, in A. D. L. Akkermans, J. D. van Elsas, F. J. de Bruin (eds.), Molecular Microbial Ecology Manual, suppl. 3 (Dordrecht: Kluwer Academic Publishers, 1998).

[5] J. L. W. Rademaker et al., Int. J. Systematic Evol. Microbiol. 50, 665 (2000).

[6] F. Tajima, Mol. Biol. Evol. 10, 677 (1993).

[7] Cho & Tiedje found r-values of individual samples to be reliable only for differ- ences of 0.1; however, the average implied in the use of many such measurements to determine L(r) allows a finer grid down to 0.05.

49 (a)

0- pC = 0.07 0- p, 0.15

L(T/T0 )

to ooe -- p =0. .

10 100 1000 To

(b)

-- T =80 100 -T =160 -- T =320 L(T/TO)

to

0-

I I 10 100 1000 TO

Figure 5. Effect of varying the two parameters of the model. (a) The coalescence probability pc. (b) The number of simulation time steps NT corresponding to one unit To of biological time.

50 Chapter 4: Dynamics and genealogy of strains in spatially extended host-pathogen models'

Abstract

In the previous chapters, we have used the method of tracing the genealogical tree to study how diversity evolves within populations. In the next two chapters, we will use this method to understand the the dynamics of evolution in a popu- lation that differs in that it interacts ecologically with another population. Using a generic spatial model of a pathogen infecting a population of hosts (which can also be considered a predator-prey system), we examine the spatial and tempo- ral dynamics of strains (pathogens descended from a common mutant ancestor). Instead of diversity, we will study the dynamics of evolution of a trait. In the model, the transmissibility of the pathogen can evolve by mutation. Strains of intermediate transmissibility dominate even though high-transmissibility mutants have a short-term reproductive advantage. Mutant strains continually arise and grow rapidly for many generations but eventually go extinct before dominating the system. We find that, after a number of generations, the mutant pathogen char- acteristics strongly impact the spatial distribution of their local host environment, even when there are diverse types coexisting. Extinction is due to the depletion of susceptibles in the local environment of these mutant strains. Studies of spatial and genealogical relatedness reveal the self-organized spatial clustering of strains that enables their impact on the local environment. The method of genealogical tracing enables us to show that selection acts against the high-transmissibility strains on long time scales as a result of the feedback due to environmental change, and pro- vides a more general measure of evolutionary fitness that reflects the importance of time scale in evolutionary processes. Our study shows that averages over space or time should not be assumed to adequately describe the evolutionary dynamics of spatially-distributed host-pathogen systems.

51 Background

Evolution in host-pathogen systems is a topic of great interest because pathogen gen- eration times are short, and hence adaptation can occur rapidly [2,3]. There are many medically and ecologically relevant examples of pathogen evolution, such as the emer- gence of drug-resistant strains [4] and the decreased virulence of introduced control agents [5]. Host-pathogen systems are typically not well-mixed, but rather are spa- tially distributed. Mutant pathogen strains arise locally, and considerable variation in type is possible from one locality to another [6]. Moreover, host and pathogen densi- ties are inhomogeneous and dynamic. It has become apparent in recent research that inhomogeneities in spatially distributed populations can fundamentally change the dy- namics of ecological systems [7,8], and host-pathogen systems are no exception [9,10]. Spatial extent can also fundamentally change evolutionary dynamics [11]. We note, in particular, that the characteristics of the pathogen can greatly affect the spatial and temporal dynamics of the host, which in turn affects the evolution of the pathogens. Host-pathogen systems can be considered as a type of predator-prey system, and these characteristics and our investigation also apply to predator-prey systems. Most pathogenic species are found to consist of a number of distinct strains, dis- tinguished from one another by a mutation or set of mutations. Our study of strain dynamics will allow us to discuss the relationship of local spatial effects and the long- term behavior of the system. In particular, we will show that mutant strains can arise that (1) increase in number rapidly over many generations; (2) are spatially clustered; (3) become extinct over longer times due to the local extinction of hosts. The evolu- tionary dynamics can be understood as a selection process that favors different types at different time scales. There are two regimes, with a sharp transition between them: a short time regime in which mutant strains with high reproduction ratios dominate, and a long time regime in which environmental feedback causes those these strains to be selected against.

The model

We consider a simple spatially-extended model of a pathogen spreading through a host population. This model allows for the mutation of a single quantitative trait, the trans- missibility of the pathogen from one host to another. Similar to other recent stud- ies [12-14], mutation is part of the dynamics of the model. The evolving population is composed of different types of pathogens. We will study the mechanisms that give rise to this composition. The class of models we consider assume that reproduction of hosts and infection

52 of pathogens occur locally in space (e.g. by contact or airborne transmission rather than waterborne transmission). We also assume that infection is ultimately fatal, so that our models are relevant to the case where infection is at least usually fatal and not to the case where infected hosts normally recover with or without immunity. As a model of predator-prey systems, it is relevant to the case where predators are capable of causing the local extinction of their prey, and when a local population is "infected" with prey it cannot recover. Our main results are insensitive to the detailed aspects of the model, including not only parameter values but also the inclusion of additional factors such as uninfected host death, limited local movement of hosts, occasional long-range dispersal of hosts, and different lattice structures. Such changes affect specific values of measured quantities, but not the generic behavior of the model. Although it could be considered a model of specific systems, our investigation is one of generic properties common to many spatially distributed systems.

The model without mutation

The model we use [15-17] is a probabilistic cellular automaton, with possible states 0 (empty), S (susceptible host), and I (infected host). At each time step, healthy hosts reproduce into each empty neighboring cell with probability g; this occurs indepen- dently for each neighboring cell. To model the carrying capacity of the environment, each cell can have at most one host individual. Alternately, each cell can be consid- ered to represent local populations, either absent or at carrying capacity. An infected host dies with probability v (virulence). Finally, an infected host I causes a neighbor- ing uninfected host to become infected with probability r (transmissibility). The state transition probabilities are: P(O - S) = 1 - (1 - g),

P(S --+ I) = 1 - (1 - r" 1

P(I -4 0) = v where n is the number of uninfected host neighbors, and m is the number of infected neighbors. Rand et al. [16] note that asynchronous updating does not significantly change the dynamics. The model differs from that in Ref. [13] only in the use of discrete time and the lack of death of susceptibles. Fig. 1 and 2 show snapshots of simulations after the long-term behavior is estab- lished, revealing how the geometry changes with differing transmissibility, virulence and reproduction rate. The system is spatially inhomogeneous, with host and pathogen distributed patchily, and spatial correlations in the distribution and reproduction of host

53 -1

0.2 0.4 0.6 0.8 1.0

0.2

0.4

v 0.6

0.8

1.0

Figure 1: Snapshots of the host-pathogen model with no mutation. Each of the 25 blocks is from a simulation with distinct parameter values. Transmissibility r and vir- ulence v are varied, with host reproduction rate g held at 0.05. Green represents healthy hosts, red represents infected hosts, and black represents empty sites. The snapshots for those parameters for which hosts, but not pathogens, survive after 100 generations appear completely green. For those that appear black, the outcome is uncertain and can be either pathogen extinction or extinction of both pathogen and host. We use an L x L square lattice with periodic boundary conditions and a von Neumann neighborhood (north, south, east and west neighbors); here L = 80. and pathogen. As in all host-pathogen models, the pathogen must have a minimum transmissibility in order to propagate. In this model, the pathogen can drive the host to extinction if it exceeds a certain transmissibility [15]. Thus, there is a minimum and maximum transmissibility at which the pathogen and host can coexist. The region of parameter space in which there is coexistence has been obtained [13]. Rand et al. compared simulations of the spatially-extended model with a mean field (well-mixed) version of the model. In both cases, they considered the dynamics of a system with pathogens of transmissibility r, and introduced pathogens with transmissi- bility r ± Ar (Ar = 0.01). In the well-mixed version, the higher-r population always invades, driving the lower-r one to extinction but itself surviving. By contrast, in the spatially-extended version, there is a value of r above which the mutant population does not successfully invade.

54 0.2 0.4 0.6 0.8 1.0

0.2

0.4

g 0.6

0.8

1.0

Figure 2: As in Fig. 1, but with transmissibility r and host reproduction rate g varied, with virulence v held at 0.5.

The model with dynamic mutation

In real systems, characteristics of the pathogen can mutate, and this must be consid- ered when making statements about the long-term behavior of a host-pathogen system. In order to investigate the evolutionary dynamics of the system, mutation should be incorporated into the dynamics of the model [12, 13]. The transmissibility becomes a variable quantitative trait which is part of an infected individual's state, rather than a global parameter. The states become 0, S, and I, (host infected with pathogen of transmissibility -r). Mutation can be introduced by assuming that there is a probability p that when a pathogen of transmissibility r spreads, the newly infected individual has transmissibility r ± E:

P(O- S) = 1 - (1 - g),

Pr-e P_ + Pr+e+ (1 - I)pr P(S - Ir) = 1 -1(l - ) [ET11 (PP-1 -e + LP tau"+eF + (1-P)Pr-) (2)

P(IT -+ 0) = V

55 where m, is the number of infected neighbors of transmissibility r. The assumption of incremental mutation will be extended later.

Behavior of the model with mutation

Mutation causes pathogens of differing transmissibility to coexist on the lattice. How- ever, the system always reaches an evolutionarily stable value of r [13], and that this value is somewhat lower than the maximum value for which pathogen and host can coexist. We show in figure 3 an evolving system at intervals of 20 generations after it has reached the evolutionarily stable average transmissibility, showing patches of prey growing and being depleted by predators of various types. Fig. 4 shows snapshots of simulations with dynamic mutation for different combinations of parameters. Each snapshot is taken after 10,000 generations, a time long enough to allow the evolved transmissibility to reach a stable value, aside from fluctuations. We find that in the presence of mutation, host and pathogen coexist for a wide range of virulence and host reproduction rate. The evolutionarily stable value is substantially below the maximum possible value. Fig. 5 shows two typical time series of the average, minimum and max- imum transmissibility of the population. In each case, the average transmissibility is seen to approach an evolutionarily stable value after several thousand generations, and then stay within 5% of this value. (For some combinations of parameters, the average varies over time by as much as 17%, but varies by no more than 5% for most). We show in fig. 5a and 5b that it will reach the same value whether the system starts with pathogens with transmissibility above or below this value.

Dynamics of Strains

An important clue to the evolutionary dynamics of the system can be seen in fig. 6, a density plot of the distribution of pathogen transmissibilities over time. Most pathogens are within 0.05 of the evolutionarily stable value of r = 0.3. However, there is an additional temporal structure that is apparent in the figure: the population appears to have offshoots that persist for tens to hundreds of generations before disappearing. These offshoots are part of the characteristic behavior of the evolving population, even after it has converged to the evolutionarily stable average transmissibility. In the plot, an example of such an offshoot occurs at time T = 26000. The offshoots are visual traces of genetically related pathogens - strains. In particular, they reflect the presence of mutant strains which substantially exceed the evolutionarily stable value of r, but then go extinct.

56 Figure 3: Snapshots of the lattice for the evolving pathogen-host model. 20 generations have elapsed between each frame. Susceptible hosts are shown as green and infected hosts are colored depending on their value of -T, as shown in the legend. In this and all subsequent figures, the system has settled to a stable value of r. The lattice size L = 100. The depletion rate v = 0.2 and prey reproduction rate g = 0.05.

In the remainder of this paper, we will analyze the evolutionary dynamics by ex- amining features related to properties of strains. We will examine: (1) the reproductive success of mutant pathogens and their descendants; (2) the lineage histories of strains; (3) the effect of pathogen phenotype on the local environment of susceptibles; and (4) the relationship of spatial and genealogical structure. Our analysis elucidates the mechanisms by which the population comes to be dominated by strains of intermedi- ate transmissibility. We believe similar mechanisms may be at work in many natural systems. In order to distinguish the identity of strains, we track the genealogy of pathogens. A strain is the set of individuals descended from a single common ancestor. One can choose any ancestor, but when studying evolutionary dynamics, it is particularly useful to consider a mutant strain to begin when a mutation occurs. Mutant descendants of this first mutant can be considered the beginning of a new strain. To obtain adequate sampling of high-transmissibility cases, which are rare under incremental mutation because of selection, we modify the evolutionary model given in equation 2 to have large, uniformly distributed mutations: #hutations to a random value of transmissibility between 0.2 and 1.0 with a mutation rate p = 0.002. This rate is low enough that it is rare that a mutant strain itself mutates again. This modification does not significantly change the evolutionarily stable transmissibility.

57 A

g 0.2 0.4 0.6 0.8 1.0

0.2

0.4

v 0.6

0.8

1.0

Figure 4: Snapshots of the host-pathogen model with mutation after 10,000 genera- tions. The transmissibility has evolved to an evolutionarily stable value. Each of the 25 blocks represents a simulation with different values of g and v as indicated. The di- mension of the lattice L is 175, the mutation rate p is 0.15 and the mutation increment e is 0.005.

58 l ' | ' 1 ' I

0.51-

0.4 111111MO 0.3

0.2

U.16 10000 20000 30000 40000 50000 T

(b)

0.5

1110J- 1111 11111 1 0.4

0.3

0.2P

0.16 10000 20000 30000 40000 50000 T

Figure 5: Time series of transmissibility r in the population, showing average, max- imum, and minimum values. (a) -ris started at 0.15, below the evolutionarily stable value of 0.3. -r evolves upward to reach the evolutionarily stable value within 7,000 generations. (b) -r started at 0.49; -r evolves downward to the evolutionarily stable value, again within 7,000 generations. The virulence v is 0.2, host reproduction rate g is 0.05, lattice size L is 250, mutation rate M is 0.15, and mutation increment e is 0.005. All of the following figures use these parameters unless otherwise noted.

59 0.5 0.45 - 0.4 0.35 0.3 0.250 0.2 -- - 0.15-

0 10000 20000 30000

Figure 6: Time series of the distribution of -r. Each vertical slice of this three- dimensional plot shows the distribution of transmissibilities at a given moment in time. Note that strains temporarily exceed the evolutionarily stable value of 0.3 but then go extinct. These correspond to the long excursions, for example at T =26000. Parame- ters are as in fig. 5.

60 Reproductive success of mutants and their descendants

To gain insight into the reproductive success of mutants, we examine the net repro- duction ratio R, which is defined as the average number of other individuals infected during the course of an individual's infection in a population where the infection is present. R has been commonly used as a measure of fitness in theoretical evolution- ary studies of host-pathogen and many other ecological systems [18, 19]. In spatially homogeneous treatments of host-pathogen systems, selection will tend to increase R, and R increases with transmissibility [20]. The reason is that if two pathogens have the same number of susceptible neighbors (a condition that applies to the homogeneous version of the model), the one with the higher transmissibility has a greater probabil- ity of infecting. This is indeed the case of the reproduction ratio of initial mutants in the spatially distributed host-pathogen model; fig. 7a plots the expected number of offspring of a mutant one generation after it arises. R increases rapidly and roughly linearly with r, from a value of one at T- = 0.3 to a value of 1.4 at r = 1. This must occur due to the equivalence of the environments into which all the mutants are introduced on average. On the other hand, this conflicts with the observation that, in the model, pathogens with intermediate transmissibility dominate. The reproduction ratio averaged over many generations, in fact, has a peak at the evolutionarily stable transmissibility. Fig. 7b shows R averaged over many generations. For values of -rcentered on the evolu- tionarily stable value of 0.3, R is slightly greater than one. (A self-sustaining stable population without mutation would have a reproduction ratio of exactly one; here the reproduction ratio is greater than one because of mutated offspring which have lower reproduction ratios.) R is significantly lower for both higher and lower values of the transmissibility, consistent with the observation of the evolutionarily stable type. Thus, selection does not act instantaneously to favor pathogens of intermediate transmissibility. The difference between the time-averaged and mutant reproduction ratios points out the need to consider the reproductive success of pathogens over the lineage history of a strain.

Lineage history of strains

Figs. 8 and 9 plot a measurement of the average population size Fj (T, r) of a strain as a function of transmissibility r and number of generations T since the beginning of the strain. Strains with higher r grow much faster than ones with lower r for a large number of generations. They reach a maximum and start declining after about 30 gen- erations and eventually go extinct. This is consistent with both Figs. 7a and b. Because

61 (a)

1.4 -

1.2 -

R

0.2 0.4 0.6 0.8

(b)

1.0005-

R 0.9995 -

0.999 -

0.9985 --

0.2 0.4 0.6 0.8 1

Figure 7: The net reproduction ratio R in an evolving population, as a function of transmissibility r. The dominant type has reached its evolutionarily stable value of r = 0.3. (a) For mutants, showing the expected number of offspring of a mutant one generation after it arises. (b) For all pathogens, averaged over 3 x 107generations.

62 r=.9

6 =.

4

2 =.3 ~=.2 0 100 200 300 400 S0 T

Figure 8: The average population of mutant strains as a function of time since the first (ancestor) mutant arose. The average population is plotted as a function of time T, with curves for various transmissibilties r. High-transmissibility strains initially grow rapidly, but reach a maximum and then decline after about 30 generations. In order to collect data for all r, mutations are large - mutants' transmissibility is set to a random value between 0.2 and 1. p is 0.002; other parameters are as in fig. 5.

63 81 I II

6 T=38

4 T=100

2 T=I 1 \T1000T=0 0.2 0.3 0.4 0.6 0.8 1.0 -r

Figure 9: As in Fig. 8, but the average population is plotted as a function of r, with curves for various T. Between T = 100 and T = 200, it can be seen that selec- tion changes from favoring higher-transmissibility mutant strains to favoring strains of intermediate transmissibility.

64 1.31

1.2 R R

.99 - 1 T==. 0.9

0. 0 40. 0 0 00 200 400 600 800 10 T T

Figure 10: The normalized per-generation reproduction ratio of a mutant strain R(T, r) = N(T, r) /T. Within 200 generations, this value drops below I for trans- missibilities significantly greater than the evolutionarily stable type. The plot on the left shows R(T, -r)for the full range of R. To show that R(T, r) < 1 when r is not the evolutionarily stable value, the right plot shows the same data with a truncated R-axis. Parameters are as in figure 8.

F (T, 7-) is the expected number of descendants of a mutant after T generations, it quantifies the reproductive success of strains. It therefore can be considered a measure of evolutionary fitness that generalizes the reproduction ratio R. In order to make a more explicit comparison with the reproduction ratio R, one can calculate the normal- ized reproduction ratio of the mutant strain over the course of its lineage history as a function of time R(T, p) = F (T, p)1/T, representing the average number of offspring per generation from the beginning of the strain to time T. In Fig. 10, this measure of reproductive success can be seen to decrease below one for types significantly greater than the evolutionarily stable value at around T = 200 generations. The high-r strains can grow in the short term, but die out in the long term, despite the fact that they have a higher net reproduction ratio for a significant time after they are first introduced. Considering the fitness of strains as a function of time allows one to characterize evolutionary systems in which the reproductive success of mutants differs on different time scales. The populations of such systems can contain a mixture of strains, each of which is successful on a different time scale. For a given time scale T, the most successful type for that time scale popt (T) is the value of p such that R(T, p) is max- imized (popt (T) = argnaxp(R(T,p))). Figure II a shows that, for the model, one type dominates for short time scales, and another dominates for long time scales, with a sharp transition between the two scales. Since selection acts differently on a given type at different time scales, one can determine the relevant time scales for a particular type. For all types p $ Pes we can define the time scale T (p) at which selection acts against p as:

65 (a) (b)

1.U 600- 0.8 T 40- 0.6 200 0.4 I 0 - 0 500 1000 0.2 0.4 0.6 0.8 1.0 T T

Figure 11: Time scale of selection. (a) The most successful type r4 pt(T) as a function of time since the beginning of the strain. Types of high transmissibility (those with high values of R in figure 7a) dominate for time scales shorter than about T = 175, while types close to the evolutionarily stable type (those with high values of time-averaged R in figure 7b) dominating on time scales longer than T = 250. (b) The time scale T, (r) at which selection acts against strains of pathogens with transmissibility T. T,(7-) is 0 for r < re,, indicating that selection acts instantaneously. For -r> re, the time scale is very long for values close to re,, converging on T, 180 for high r. Parameters are as in figure 8.

T,(p) = minT such that for all t > T , F (t, p) < F (t,Pe,)

Thus for some T < T,(p), mutants of phenotype p have more descendants than those of pe,. The time scale at which the evolutionarily stable type begins to dominate is given by TL = minT such that Popt (T) = pea. For the host-pathogen system, figure 1 lb shows T,(7) (p = t). For -r < 7-,, T,(7) = 0 since these low-transmissibility types have a disadvantage on all time scales. For r > Tea, T (-) approaches a constant number of generations (about 200 for the parameters used in figure 1 lb) but is larger when r is close to re,. Thus, for r > r,,, on time scales significantly shorter than T., the dynamics of the relative frequencies of different types can be determined from their values of the net reproduction ratio R; on longer time scales, other mechanisms are essential to the dynamics, such as the feedback between the population and the environment. In general, when a type has a short-term advantage (R(p) < R(pea)), T,(p) is a quantitative measure of the time scale in which instantaneous change in frequency dominates the evolutionary dynamics for that type. Finally, one can measure the long-term invasibility by a particular strain. Define the

limiting invasionfitnessF 1 (p) of type p to be 1iMT _,'F(T,p). of type p. Similarly to R averaged over time, F peaks at the evolutionarily stable value. However, considering

66 fitness to be a function of time, rather than a single number, allows one to characterize the time-inhomogeneous nature of evolutionary systems in which short-term and long- term fitness are different. Because some of the individuals in the population can be of rapidly-reproducing types that have high short- or medium-term fitness but low long-term fitness, con- sidering only the strain fitness F plus mutation-selection balance does not present a complete picture of the long-term composition of types in the population. Instead, the distribution of types P(p) in the population is related to F as follows:

Fi(T,p) P(p) = f* T n.+,s A(PI)(fJT 7 (TIp71) where pL(p) is the rate of initial mutants of type p arising, and ne, is the average number of individuals of type pe, present; ne, must be measured directly since the integral of Fi(T, p,,) is infinite. P(p) measured numerically agrees with the above (except for types which are within 0.1 of the evolutionarily stable type, since these strains take a long time to decline and were not tracked longer than 1000 generations).

Relationship of pathogen type and spatial structure

To understand the lineage history of strains, it is helpful to examine the relationship of pathogen phenotype and the local environment of susceptible hosts. Fig. 12 shows a mutant strain 50 generations after it arose, with a value of -r that is significantly above the evolutionarily stable type. This strain has arisen from a single ancestor at time to which mutated from a lower value of r. By time to + 250, the strain has become extinct. The figure suggests that the local environment is significantly altered by the mutant type. Fig. 13 shows enlarged views of two panels of Fig. 1, where it can be seen that the local configuration of susceptibles that an average pathogen finds itself in changes with r. Thus, the characteristics of the pathogen shape the host patches that they find themselves in. While a complete characterization of the local envrironment is difficult, we can consider the local density of susceptible hosts adjacent to an infected host as a first approximation. Using this measure, the effect of the pathogen phenotype on the local environment of susceptible hosts is apparent in Fig. 1, where only a single type is present in any one simulation. Fig. 14 shows that the characteristic length scale of host patches also changes with r.

67 0 .25 .5 .75 1

Figure 12: A snapshot of the model with mutation, with r shown as color as indicated in the legend. Yellow represents a high-transmissibility (r = 0.9) mutant strain which arose 50 generations ago. Hosts are shown as dark green. The lattice size L is 175. We see that the mutant strain is spatially clustered and is depleting the hosts from its local environment. This environmental change leads to the eventual extinction of the strain.

68 (a) (b)

Figure 13: Magnification of the model with no mutation, showing an example of how transmissibility governs the local spatial structure of susceptible hosts the pathogens find themselves in. (a) The transmissibility r is 0.2. (b) r is 0.45.

-=0.15 - 0.25 0.8 - =0.35 - T=0.40 -- 0.45 0.6 c(d) 0.4-

0.2-

OL 0 20 40 60 80 100 d

Figure 14: Spatial auto-correlation c(d) in healthy hosts as a function of distance d. The decrease with distance shows the characteristic length scale of host patches. The inset shows the characteristic length scale 1 of the patches as a function of r, the length at which the correlation drops to 1/e. (Note, however, that the extinction of the parasite for high r is not necessarily due to the length scale approaching the size of the lattice; high-r populations can eventually go extinct even an infinite system).

69 1-

=0.29

0.8- - r=0.3 0.7-

T=0.9 0 X 4 a 0 IN IM I4* T

Figure 15: The contact rate, p (the number of neighboring susceptible hosts), averaged over all individuals infected with a strain of a particular type, as a function of time since the strain first arose. Within 40 generations, the local environment in the vicinity of the strain has been changed from the value characteristic of the evolutionarily stable type to a value characteristic of the mutant strain. This characteristic value is plotted in fig. 16. Parameters are as in fig. 8.

Strains that arise by mutation are generally located in an area whose local environ- ment has been determined by the strain it mutated from. We find, however, that after the first mutant arises, the new strain changes the local environment, measured by the local density, to the environment that is characteristic of it. Fig. 15 shows the local con- tact rate of susceptible hosts as a function of the time since the strain arose, where the change can be seen to take about 40 generations. Fig. 16 shows an average over time of the local contact rate for the evolving system and compares it with the system with only one type. We see that, for all values of -, the local contact rate for mixed systems (with mutation) is the same as that for homogeneous systems (without mutation), even though in the mixed system, many strains exist on the same lattice and individuals are constantly mutating.

70 'I 'I 0.3(-

0.25-

d 0.2-

0.15-

0.1- 0.2 0.4 0.6 0.8 1

Figure 16: The contact rate, p, as a function of transmissibility r. Squares represent data measured in nonmutating populations where all pathogens are of the same type, and circles represent data taken in evolving populations, where many other strains with different r are present. Parameters are as in fig. 8. (Data for high -rin homogeneous systems are more variable since the pathogen drives the host to extinction and hence a shorter time series is available).

71 Spatial and genealogical structure

In order for the strains to systematically modify the local spatial structure of their en- vironment, we expect that they are at least partially spatially segregated. To study this directly, we show in fig. 17 a representation of the spatial structure of genealogical distance. In this picture, the colors show the degree of genealogical relatedness to a particular pathogen. The left and right panels show this for two different individuals at the same time in an evolving population, simulated using incremental mutations. Fig. 18 plots the average genealogical distance as a function of physical distance between the two individuals in space, averaged over all pairs at that physical distance. The ge- nealogical distance is small for short distances, reflecting the likelihood that nearby individuals are genetically related because of the locality of reproduction. It increases for longer spatial distances, indicating that strains are physically clustered on short and medium time scales. We note that physical distance remains small for genealogical distances of hundreds of generations. This implies that the typical lifespan of mutant strains (200 generations) is small enough so that strains go extinct before spreading throughout the space.

Conclusion

We have shown that the evolutionary dynamics of a generic host-pathogen model can be understood by characterizing the reproductive success of strains on different time scales. Neither spatially averaged properties nor time-averaged local properties can reveal the mechanisms responsible for the long-term composition of the population. Characterizing the strain dynamics is made possible by tracing the genealogical tree of the population. In particular, we have shown that: (a) the time-averaged reproduction ratio has a maximum at intermediate transmissibilities, but the reproduction ratio of mutants when they first arise increases with transmissibility; (b) the lineage history of high- transmissibility strains shows that they grow faster than intermediate-transmissibility ones for a significant length of time before declining and going extinct; (c) the char- acteristics of the pathogen determine the spatial distribution of the host, and hence the environment that the pathogens find themselves in. Strains that reproduce and grow over several generations change the local environment over time, and their local effect is the same whether or not there are other types on the lattice; (d) the genealogical distance between pathogens is correlated with their spatial distance. In summary, we find that high-transmissibility strains change their environment in a way that is ultimately detrimental to their survival. However, there is a significant

72 ......

1380 b0 1386

Figure 17: Genealogical distance between individuals in space. Distance from the individual marked by the arrow is shown as color. Yellow indicates pathogens that have a recent ancestor in common with the pathogen indicated by an arrow; red represents ones that have the most distant common ancestor (a distance of 1,380 generations). Pathogens of the same color are not necessarily related to each other. The two plots show relatedness from two different individuals at the same time.

1000

750

T 500

250

"0 10 20 30 40 50 60 70 d

Figure 18: The average number of generations since a pair of individuals had their most recent common ancestor (coalescence time), as a function of their distance from each other. Since the size of the system being simulated is only 100 x 100, the levelling off of the curve may be due to the finite system size. Other parameters are as in fig. 5.

73 time delay before this change leads to their extinction. During this time, these strains take advantage of the host spatial structure generated by the evolutionarily stable type and are able to propagate rapidly before going extinct. Since new mutant strains arise by mutation, it is possible for strains which, by themselves, would drive the host to extinction, to be continually present in a population if mutations are frequent enough. In systems like the one studied, reproductive success must be thought of as a function of time. The composition of types of this, and, we believe, many natural systems, can be understood as a mixture of types, each of which is successful on a particular time scale.

References

[1] Rauch, E.M., Bar-Yam, Y. & Sayama, H. Dynamics and genealogy of strains in spatially extended host-pathogen models. J. Theor: Biol. 221, 655-664 (2003).

[2] Anderson, R.M. & May, R.M. Coevolution of Hosts and Parasites. Parasitology 85,411-426 (1982).

[3] Levin, S.A., Grenfell, B., Hastings, A. & Perelson, A.S. (1997). Mathematical and Computational Challenges in Population Biology and Ecosystems Science. Science 275, 334-343.

[4] Schrag, S.J. & Perrot, V. (1996). Reducing antibiotic resistance. Nature 381, 120- 121.

[5] Fenner, F. (1983). Biological control as exemplified by smallpox eradication and myxomatosis. Proc. R. Soc. Lond. B Biol. Sci. 218, 259-285.

[6] Pielou, E.C. (1974) Biogeographic range comparisons and evidence of geo- graphic variation in host-parasite relations. Ecology 55, 1359-1367.

[7] Kareiva, P. (1990). Population-dynamics in spatially complex environments - the- ory and data. Phil. Trans. R. Soc. Lond. B Biol. Sci. 330, 175-190.

[8] Tilman, D. & Kareiva, P. eds. (1997). SpatialEcology: The Role ofSpace in Popu- lation Dynamics and Interspecific Interactions.Princeton Univ. Press, Princeton, NJ.

[9] Mollison, D. (1977). Spatial contact models for ecological and epidemic spread. J.R. Stat. Soc. Ser: B 39,283-326.

74 [10] Comins, H.N., Hassell, M.P. & May, R.M. (1992). The spatial dynamics of host- parasitoid systems. J. Anim. Ecol. 61, 735-748.

[11] Sayama, H., Kaufman, L. & Bar-Yam, Y. (2000). Symmetry breaking and coars- ening in spatially distributed evolutionary processes including sexual reproduc- tion and disruptive selection. Phys. Rev. E 62, 7065-7069.

[12] Savill, N.J. & Hogeweg, P. (1998). Spatially induced speciation prevents extinc- tion: the evolution of dispersal distance in oscillatory predator-prey models. Proc. R. Soc. Lond. B Biol. Sci. 265, 25-32.

[13] Haraguchi, Y. & Sasaki, A. (2000) The evolution of parasite virulence and trans- mission rate in a spatially structured population. J. Theo. Bio. 203, 85-96.

[14] Boots, M. & Sasaki, A. (2000). The evolutionary dynamics of local infection and global reproduction in host-parasite interactions. Ecol. Lett. 3, 181-185.

[15] Sato, K., Matsuda, H. & Sasaki, A. (1994). Pathogen invasion and host extinction in lattice structured populations. J. Math. Biol. 32, 251-268.

[16] Rand, D.A., Keeling, M., & Wilson, H.B. (1995). Invasion, stability and evolution to criticality in spatially extended, artificial host-pathogen ecologies. Proc.R. Soc. Lond. B Biol. Sci. 259,55-63.

[17] Andjel, E. & Schinazi, R. (1996). A complete convergence theorem for an epi- demic model. J. Appl. Prob. 33, 741-748.

[18] Fisher, R.A. (1930). The genetical theory of naturalselection. Oxford University Press, London.

[19] Brommer, J.E. (2000). The evolution of fitness in life-history theory. Biol. Rev. CambridgePhil. Soc. 75, 377-404.

[20] May, R.M. & Anderson, R.M. (1983). Epidemiology and genetics in the coevo- lution of parasites and hosts. Proc. R. Soc. Lond. B Biol. Sci. 219, 281-313.

75 Chapter 5: Long-range interactions and evolutionary stability in a predator-prey system

Abstract

It is now becoming recognized that spatial distribution is crucial to many eco- logical and evolutionary processes [1]. At the same time, many natural systems have both local and long-range connections, a structure recently modelled as "Small- World networks" [2], This raises the questions of whether spatial models are rel- evant and how spatial evolutionary systems are able to persist, since long-range interactions tend to counteract the effects of spatial separation. In this chapter, we use the predator-prey model of the previous chapter to illustrate the effect of long- range interactions on evolving systems. The evolutionary steady state in which predators are prevented from overexploiting their prey can be disrupted by the ad- dition of long-range links. However, spatial behavior can remain even with a high density of long-range links. A parallel between this result and recent results from network theory is shown. The results suggest that the addition of long-range inter- actions may destabilize an evolutionary system, even if that system already con- tains a significant density of such interactions. Systems with larger spatial scale are more sensitive to the addition of long-range links.

Traditional approaches to biological modelling represent evolutionary and ecological systems using quantities averaged over space and time. These systems are represented as the frequencies of various genes or types in the population. In this formulation, the environment experienced by a particular individual is effectively the average over all environments. Such averaging is done on the principle that spatial fluctuations become unimportant in the limit of large populations. However, real populations are distributed spatially. Organisms are located at different points in space and thus experience dif- ferent environments, and interactions between organisms are local. In recent times the importance of spatial inhomogeneity on the dynamics, stability, and diversity of eco- logical [1] and evolutionary [3] systems has begun to be recognized; this inhomogene- ity may be crucial to the predictions of evolutionary and ecological models. A common

76 theme of this work is that spatial distribution is necessary for the stability of evolution- ary systems. A number of such systems are unstable when they are well-mixed (that is, when interactions are global rather than local) [4,5]. Many real spatially distributed systems, however, though interacting mostly locally, also have long-range interactions. In a biological context, long-range interactions can arise when individuals can disperse via spores or when their seeds are transported long distances, for example, or when invasive types are introduced from other locations by humans. In the limit of many long-range links, spatial models change to the behavior charasteristic of well-mixed systems. The desire to investigate the effect of long-range connections on the connectivity of natural and artificial networks has led to the study of "Small-World" networks [2], which has shown that in fact it takes only a small amount of long-range interaction to make a mostly locally interacting system behave as a well-mixed system in some respects. The prevalence of long-range interactions thus calls into question the relevance of spatial models, and raises the question of how evolutionary systems that depend on spatial separation are able to persist. The work on Small-World networks has emphasized the implications of long-range links for the diameter and average shortest path length of the network: a small fraction of long-range links (with most of the links remaining local) is enough to make these values scale as a random network (in which there is no locality), and thus allows almost any node to be reached from any other by traversing a small number of links [2,6]. The diameter of the network is relevant to the dynamics of simple processes such as the spreading of epidemics [2,7,8] and diffusion [9], and thus the phenomenon of reduced diameter captures what is important about long-range links for the dynamics of these systems. However, this is not necessarily the case for other types of dynamics on the network. Here we will study the effect of long-range links on the ability of predator and prey to coexist. Previous work has considered the evolutionary dynamics of dispersal distance in systems where this distance can evolve [10-14]. Ref. [10] showed that in a spatial predator-prey system exhibiting spiral waves, increasing dispersal distances are selected for and this may lead to the extinction of the population (though effects related to the boundary may prevent this). We will relate spatial evolution to recent work on networks, and show that the best-known "Small-World" phenomenon, the dramatic reduction in the diameter of the network that results from adding only a few links, is not the one that is relevant to the evolutionary dynamics of the predator-prey system. The stability of the predator-prey system is shown to depend on local spatial separation rather than global properties, and thus can be understood in light of the result [7,15,16] that the transition between spatial and non-spatial connectivity on a lattice as links are

77 0.8-

0.6-

0.4-

0.2-

0 100 150 200 L

Figure 1: Evolutionary stability as a function of lattice size L. Each point represents the probability pe, that the predator and prey will coexist for at least 100,000 generations, obtained from 10 runs. There is a sharp transition at a size Les. For the left curve (circles), the prey reproduction rate g = 0.2. For the right curve (x symbols), g = 0.05. The depletion rate v = 0.2 and the lattice size L = 250. added is not a sudden transition, but takes the form of a crossover, with the system remaining spatial on length scales shorter than a crossover distance which depends on the density of long-range links.

Effect of long-range interactions

In the predator-prey model, evolutionary stability requires a minimum lattice size Les. Below this size, the predator-prey system is evolutionarily unstable: a high-r strain eventually dominates the whole space and causes its own extinction (Fig. 1), thus caus- ing the predator species to go extinct. Les is obviously related to the spatial structure of the predator and prey patches, but it is, in general, larger than the characteristic size of these patches, indicating that more than one patch is necessary for overexploiting mutants to keep from dominating the system before going extinct. With its dependence on spatial separation, the evolving predator-prey model thus provides a way to examine the effect of long-range interactions on spatial evolutionary processes. We introduce long-range connections into the predator-prey system by re- placing the lattice of the model with a Small-World network [2]: we begin with a square lattice in which each site is connected to its four neighbors, and randomly rewire each link with probability p. Thus, there are 4pL 2 long-range connections; a regular lattice is a special case with p = 0. We will first assume that the connections are fixed in order to make direct connections to work on Small-World networks. We will then also consider long-range connections that arise dynamically. In the limit of p = 1, there are no local links; predators and prey can interact with

78 others anywhere on the lattice, so the predator-prey system reduces to the spatially homogeneous behavior [4, 17] in which higher-r predator strains always outcompete lower-T ones. This occurs even when this leads to the extinction of the predator, as it does for a wide range of parameter values. However, since many real systems consist of mostly local interactions with some long-range ones, we are interested in the behavior of the system as p is increased from 0. The most basic question is under what conditions the system remains stable, that is, in what cases predator and prey can coexist. Figure 2 shows the probability of predator and prey coexisting over a long time scale (100, 000 generations) as a function of p, showing that there is a sharp transition from coexistence to extinction at a value pc which depends on the parameters. For some parameters, pc can be quite high, corresponding to a small network diameter; that is, the system is stable even in part of the Small-World regime. For example, in figure 2a, as p is increased, the system becomes unstable only at a value of p at which the diameter of the network has been reduced to approximately that of a random network. Thus, the system remains in the spatially-extended regime despite long-range interaction and the Small-World effect. For the parameters in Fig. 2a, the value of pc is 0.45, so that the system is still stable when there is, on average, about one long-range connection attached to every site. The following table shows pc for a range of prey reproduction rates and depletion rates. At right is a measure of the characteristic size of the patches; the table shows the length 4, at which spatial auto-correlation drops to when there are no long-range links.

pc | g=.05 | g=. |Ig=.2| l g=.05 g=.1 g=.2 v=.1 .02 .3 1* v=.1 9.8 2.9 2.6 v=.2 .001 .02 .45 v=.2 4.8 3.3 2.6 v=.4 .0001 .002 .3 v=.4 4.9 3.5 2.5

* indicates the system is stable for all p. pc is somewhat correlated with patch size (R 2 = 0.32). When patches are larger in the absence of long-range links, the system is more sensitive to the addition of links. lP, however, is only a rough indication of the characteristic spatial scale of the system, since the addition of long-range links changes its spatial structure. For some biological scenarios, such as pathogens dispersing by spores, it is more realistic that only the predator disperses long distances, and does so dynamically in- stead of along fixed long-range connections. We model this as follows: p describes the

79 (a)

0.8 -0.8

0.6- -0.6 I(p) p"l 0.4- -0.4

0.2- -0.2

0.0001 0.001 0.01 0.1 1 p

(b) (c)

0.93 T '.5

0.92-

0.91.

r 0.9 T 0.4

0.89- 0.35- 0.88-

0.87 0.2 0.25 0.3 0.35 0.4 0.45 . o.ooo 0.00075 0.01 0.00125 P P

Figure 2: (a) Evolutionary stability on a two-dimensional Small-World network as a function of p. The plot shows the probability pe, that predator and prey will coexist for 100,000 generations, as a function of p, averaged over 11 runs. This is plotted for two prey reproduction rates: g = 0.05 (circles) and g = 0.2 (squares). There is a sudden transition to instability at a density pc, defined as the density such that for all p > p,, the probability of coexistence is less than I. For comparison, the average path length 1(p) between nodes is plotted as a fraction of the average path length 1(0) for p = 0 (dashed line, same scale). The dotted line shows the value of l(p)/l(0) for a random network. Note the logarithmic scale of p. The depletion rate v = 0.2 and the lattice size L = 250. (b) The evolutionarily stable reproduction rate r, as a function of p on a 2D Small-World network, averaged over the last 200 generations of 10 runs of 100,000 generations. r,, is also plotted for values of p for which the predators go extinct; these values are indicated with shading and the average of the last 200 generations before extinction is plotted. (c) As (b), but for g = 0.05.

80 probability that a predator (pathogen) will have a chance to disperse to a randomly cho- sen site, and the pathogen infects with probability r if that site contains a host (prey). This leads to similar results, although the value of pc is increased:

PC 1 g=.05 g=.1_] g=.2 v=.1 .1 .8 1* v=.2 .005 .2 1* v=.4 .001 .02 .9

Insight into the nature of the transition to instability can be gained by examining the evolved reproduction rate res of the predators. Increasing the density of long- range interactions raises res. Figures 2b and c plot the evolved average value of the reproduction rate res as a function of p, showing that higher values of p are associated with higher res. This is consistent with exploiting predators being able to escape local extinction more easily through long-range links.

Evolutionary stability and Small-World phenomena

The work on Small-World networks has found that adding a small density of links to a locally connected network causes these networks to have an average path length and diameter that scales logarithmically with the number of nodes, like a random network, even though most links are nearest-neighbor links like those of a lattice. By contrast, on a lattice, the maximum distance between two nodes scales polynomially in the number of nodes N: in two dimensions, for example, it scales as vH. Further work has elucidated the nature of this phenomenon and the transition to small-world behavior [7,18]. The small-world phenomenon of shortening path lengths is actually effective only between nodes that are more than a certain distance 6(p) from each other on the underlying lattice (that is, the distance on the lattice without the addition of long-range links). For nodes whose distance on the underlying lattice r is less than this crossover distance, the path length still scales linearly in r. Thus, a spatially extended network with long-range links retains its spatial connectedness in local areas, and only exhibits the small-world property between more distant nodes. 6 depends on the link density p, and adding additional long-range links to the network shrinks the neighborhood in which "large-world" scaling of distances holds. Figure 3 plots the average path length 1 between nodes whose distance on the original lattice is r, showing that this length scales linearly for r < 6 but is approximately constant for r > 6.

81 80 - ' I ' r ' ' ' 1 ' i ' - p= 10-4- 60-

1 40- P=I0-3

2020 - P= 10-2

0 20 40 60 80 100 120 140 160 r

Figure 3: (a) Average path length 1 as a function of distance r on the original lattice and density of long-range links p. The lattice size L = 250.

Discussion

The original work on Small-World networks emphasized the sudden nature of the tran- sition to short path lengths between nodes. However, there are at least two "small world phenomena": the abrupt reduction in the diameter of the network, which is relevant to global properties of the system, such as the time for a spreading process to reach all sites; and the crossover between "large-world" and "small-world" behavior. The per- sistence of spatial behavior in the presence of long-range links suggests that the latter is more relevant to processes, such as the predator-prey dynamics, that occur at particular length scales. The predator-prey system can remain stable even when the network di- ameter has been reduced considerably, even in some cases to that of a random network, showing that the reduction in average path length is not the relevant phenomenon for understanding the effect of long-range interactions. The system undergoes a transi- tion from stability to instability as p is increased and the crossover distance (or spatial neighborhood size) decreases, suggesting that characteristically spatial behavior can remain if the spatial neighborhood size is large enough to allow the necessary spatial separation. The simulation results show an inverse relationship between the characteristic size of the patches and pc This, and the observation that p is a decreasing function of , suggest that the larger the spatial structure in the distribution of predator and prey, the smaller the density of long-range links it takes to destabilize the system. The link density pc at which the system becomes unstable has a corresponding spa- tial neighborhood size (pc). This value is, in general, larger than the characteristic length scale of the patches. If the spatial size of the system is such that there is only a single patch on average, this patch can be taken over overexploiting strains, caus-

82 ing the extinction of the system. There need to be several patches within the spatial neighborhood size in order for the overexploiting strains to go extinct. The fact that the system can remain in the spatially-extended regime despite long- range connections suggests that evolutionary systems that depend on spatial separation can be robust to the presence of long-range interactions. Accordingly, spatially ex- tended models can be appropriate to such systems even when they have long-range interactions, as long as the density of links is not too high. There are many other pro- cesses in spatially extended biological systems that operate on particular length scales, and such systems may, as the predator-prey system, require spatial separation only in a limited spatial neighborhood for their characteristically spatial behavior. Possible other systems include formation of patterns in ecosystems by short-range activation and longer-range inhibition [19] and pattern formation in excitable media, such as spi- ral waves [20]. The transition from spatial to homogeneous behavior can be sudden, however. In the predator-prey system, this transition takes the form of the extinction of the predator when enough links are added. As shown in fig. 2, adding a few links at the critical value of p allows an overexploiting predator strain to dominate the system before depletion of prey causes its extinction, leading to the extinction of the predator species (and in some cases the prey as well). Thus, the effect of increasing long-range interactions in a system depends on how close it is to the transition, but systems with larger spatial structure are more sensitive to destabilization by increasing long-range interactions.

References

[1] Tilman, D. & Kareiva, P., eds. (1997). SpatialEcology: The Role of Space in Pop- ulationDynamics andInterspecific Interactions.Princeton Univ. Press, Princeton, NJ.

[2] Watts, D.J., & Strogatz, S.H. (1998). Collective dynamics of "small-world" net- works. Nature 393, 440-442.

[3] Sayama, H., Kaufman, L. & Bar-Yam, Y. (2000). Symmetry breaking and coars- ening in spatially distributed evolutionary processes including sexual reproduc- tion and disruptive selection. Phys. Rev. E 62, 7065-7069.

83 [4] Rand, D.A., Keeling, M., & Wilson, H.B. (1995). Invasion, stability and evolution to criticality in spatially extended, artificial host-pathogen ecologies. Proc. R. Soc. Lond. B Biol. Sci. 259, 55-63.

[5] Rauch, E. M., H. Sayama, and Y. Bar-Yam (2002). Dynamics and genealogy of strains in spatially extended host-pathogen models. J. Theor. Biol. 221, 665-664.

[6] Newman, M. E. J., Moore, C., and Watts, D. J. (2000). Mean-field solution of the small-world network model. Phys. Rev. Lett. 84: 3201-3204.

[7] Moukarzel, C.F. Spreading and shortest paths in systems with sparse long-range connections. Phys. Rev. E 60, R6263.

[8] Moore, C. and M. E. J. Newman (2000). Epidemics and percolation in small- world networks. Phys. Rev. E 61, 5678-5682.

[9] Jespersen, S., and Blumen, A. (2000). Small-World networks: Links with long- tailed distributions. Phys. Rev. E 62, 6270-6274.

[10] Savill, N.J. and Hogeweg, P. (1998). Spatially induced speciation prevents extinc- tion: the evolution of dispersal distance in oscillatory predator-prey models. Proc Roy. Soc. Lond. B x 265: 25-32.

[11] Savill, N.J. and Hogeweg, P. Competition and Dispersal in Predator-Prey Waves. TheoreticalPopulation Biology 56, 243-263 (1999).

[12] Koella, J.C. (2000). The spatial spread of altruism versus the evolutionary re- sponse of egoists. Proc Roy. Soc. Lond. B 267: 1979-1985.

[13] Murrel, D.J., Travis, J.M.J. and Dytham, C. (2002). The evolution of dispersal distance in spatially-structured populations. Oikos 97:229-236.

[14] Rousset F, Gandon S. (2002). Evolution of the distribution of dispersal distance under distance-dependent cost of dispersal. Jour.Evol. Biol. 15: 515-523.

[15] Barthelemy, M. and Amaral, L.A.N. (1999). Small-World Networks: Evidence for a Crossover Picture. Phys. Rev. Lett. 82, 3180-3183.

[16] Moukarzel, C.F., & Argollo de Menezes, M. (2002). Shortest paths on systems with power-law distributed long-range connections. Phys. Rev. E 65, 056709.

[17] de Aguiar, M. A. M., Rauch, E. M., and Bar-Yam, Y. (2003). On the mean field approximation to a spatial host-pathogen model. Phys. Rev. E 67, 047102 (2003).

84 [18] Sen, P., and Chakrabarti, B. (2001). Small-world phenomena and the statistics of linear polymers. J. Phys. A 34, 7749.

[19] Hutson, V., and Vickers, G.T. (2000). Reaction-diffusion models. In The Geome- try of Ecological Interactions, U. Dickmann, R. Law and J.A.J. Metz, eds. Cam- bridge University Press, Cambridge, pp. 461-486.

[20] E. Meron (1992). Pattern formation in excitable media. Phys. Rep. 218, 1.

85 Chapter 6: Related work

Abstract

The prior work which is most relevant to this thesis is identified in the intro- ductions to Chapters 2, 4, and 5. Here we describe and briefly give key results from these and additional related works.

Scaling and dynamics of diversity

Genetic structure in spatial populations. The theoretical study of genetic structure in populations dates the work of Wright [1,2] and Malecot [3,4]. Their work on "isolation by distance" models the dependence of genetic distance on spatial distance in popula- tions with spatial structure. More recent work on this topic is reviewed in Ref. [9]. More recently the genetic structure of populations has been studied by studying properties of the genealogical tree of a population, a method known as coalescent the- ory, and related methods [5-10]. They study the time to most recent common ancestor and the time between successive coalescences, for eaxample. In a well-mixed popula- tion, the time TA until the most recent common ancestor scales as the population size N [5]. These works, however, have not generally focused on the scaling of diversity in populations or its distribution within a population. Coalescent methods normally assume a well-mixed population, but have been extended to models that include struc- ture in the population. Approaches to modelling spatial structure in these related works generally have generally been of two types: island models and stepping stone models. Island models contain an infinite number of of panmictic populations with migration between them; individuals migrate at some rate to other populations chosen at ran- dom [2, 11]. Thus they model the effect of separation but not spatial extent. A major reason for the use of this model is analytic tractability. Ref. [12] gives results on genetic structure using this model. The time to common ancestor is also linear in population size in the island model [11, 13]. The stepping stone model [14] consists of a group of well-mixed populations connected in a spatial structured way, with nearby populations interacting weakly via migration. The lattice model used in this work is a fine-grained

86 version of the stepping stone model. Refs. [15, 16] gives results for genetic distance as a function of spatial distance in this model. The distribution of coalescence times t for two individuals separated by a distance d has been determined in a two-dimensional stepping stone model to be c(d, t) ~e / [17].

Total diversity. The diversity of alleles at a particular locus was studied in models of both well-mixed and spatial populations [18-21]. However, the allelic diversity of these studies is not the same as genetic diversity, since, like species diversity, it treats all non- identical individuals as equally distinct, and does not consider the differing degrees to which individuals can diverge from each other. Thus, the statistical methods that were used to study allelic diversity are not readily generalizable to genetic diversity. From results in Ref. [22], the scaling of diversity (number of segregating sites) can be obtained: it scales as N log(N), where N is the population size. For a well-mixed, sexually reproducing population, the number of polymorphic sites was approximated as S ~ Nu, where u is the mutation rate [23]. These results have also been found using coalescent theory to hold for the island model under a range of conditions [11, 24-26]. The diversity-area relationships given in this thesis are concerned with the scaling of diversity with the whole population. The question of scaling of number of segregat- ing sites with sample size, when the population is of fixed size, has been found to be logarithmic in panmictic populations [22] and the island model [12, 15]. This logarith- mic relationship was found via simulations to hold for the spatial model used in this thesis. The quasi-species model [27] has been studied in the physics literature. The work on this model considers the dynamics of mutation and reproduction of sequences di- rectly. The population is assumed to be very large so the dynamics can be described using ordinary differential equations. This work considers questions such as the time to convergence to the optimum on simple fitness landscapes. Diversity as measured by the number of species has been modelled theoretically [28,29]. However, as in the work on allelic diversity, these works treat all non-identical individuals as equally distinct, and does not consider the differing degrees of diver- gence.

Distribution of diversity and fluctuations. A result related to our finding on the dis- tribution of uniqueness is found in Ref. [30]. The line of descent from an individual to the most recent common ancestor of any other individual is referred to as an ex- ternal branch. The average length of the external branches in a panmictic population was calculated to be linear in population size, but these works did not calculate the

87 distribution. Ref. [31] found that the time to the most recent common ancestor can exhibit large discontinuous jumps.

Measures of diversity. There are three commonly used measures of within-species diversity: number of segregating sites, Wright's FST statistic, and nucleotide diversity. The number of segregating sites [22] is a common measure of genetic diversity, and the one used in most of the results in this paper. It is defined as the number of loci at which more than one allele is represented in the population. Wright's FST statistic is often used in experimental studies. It measures the degree of differentiation between subpopulations relative to the diversity of the whole popula- tion. It is defined [4] as FST = (Ko - K,)/(1 - K,), where KO is the probability of identity by descent for two alleles drawn at random from within a single (panmictic) population, and K, is the probability of identity by descent for two alleles drawn at random from different (separated) populations, averaged over all pairs of populations. For the Island model, Wright found that FST 4n+1 where n is the size of each pan- mictic population and m is the migration rate. For the stepping stone model, FST has been calculated to grow only weakly (as log(N), where N is the number of panmictic populations [32]). As discussed in Chapter 2, FST is very dependent on the choice of the division of the population into subpopulations, and may miss much of the existing diversity because it averages over pairs of populations. Many populations are not di- vided into discrete subpopulations but nevertheless exhibit structure (such as the lattice model in this thesis). Furthermore, FST does not make sense when the population is finely divided into small subpopulations, because KO tends to I and hence FST tends to 1, whatever the makeup of the population. Another measure of diversity, nucleotide diversity [33], is defined as the average genealogical distance between all pairs of individuals in the population. It is similar to segregating sites, but counts mutations more if they are represented in more individuals in the population.

Further connections. The dynamics of reproduction in the diversity model used in this work is the same as the voter model [34]. In this model, each site on a lattice has a state, and at every time step, each site adopts the state of a randomly chosen neighbor. Similar processes have more generally been studied in the physics literature under the name of coarsening. In the diversity model used in this thesis, the dynamics of any subset of the population, including any set of descendants of a single ancestor, follows the coarsening dynamics of the voter model. However, it includes the additional feature, not found in the voter model, of mutation which allows diversity to increase as well as

88 decrease. Spatial tree processes with similarities to genealogical trees, called branching Brow- nian motion and branching diffusions, have been studied [35]. In these models, the branching is viewed as happening forward in time, in contrast to coalescent models where time is considered from the present backwards. However, in these processes, the branches take place without regard to the local density of individuals present, so they lack any interaction between the individuals. It was shown in Chapter 2 that populations evolve to a state where their diversity changes intermittently, increasing smoothly but decreasing in jumps with no charac- teristic size. This resembles critical point phenomena in physics, in which, when a particular parameter reaches a critical value, fluctuations or structures appear that have no characteristic size, but rather come in all sizes up to the size of the system. These fluctuations or structures have a power law distribution of sizes. In particular, the fluc- tuations in diversity resemble self-organized criticality [36], since there is no parameter that must be tuned in order to obtain critical behavior. Self-organized criticality mod- els exhibit "avalanches", chains of activity, of all sizes up to the system size, which are triggered by small events. The loss of unique mutations in the lattice diversity model can be compared with these avalanches.

Dynamics and genealogy of strains in spatially extended host-pathogen models

This thesis proposes a generalized fitness measure F (T, p) of a type p, that takes evo- lutionary time scale T into account. T is the number of generations since introduction of a mutant strain. The most common measures of fitness currently used are based on the instantaneous rate of change in frequency of a type (that is, they do not consider whether an organisms descendants may have a fitness that is systematically different from it). Two related measures, R and r, are most often used to quantify reproductive success [37] [38] [39]. R, the net reproduction ratio, measures the expected number of surviving offspring produced by an organism over its lifetime. In a nonmutating population, R is equal to one. In a population of constant size with mutation, the max- imum R may be greater than one to balance the mutant types which have a value of R less than one. r is the "Malthusian parameter" [40] [39], and measures the per capita instantaneous rate of increase of the population of a type per unit time. The two mea- sures differ mainly in that R measures time in generations whereas r measures time irrespective of the length of a generation. Fi (1, p) is equivalent to the net reproduction ratio R for mutants of type p.

89 The concept of invasibility is another approach to the question of what types will come to dominate a population. One considers a population dominated by a phenotype p and asks whether a mutant phenotype p/ can invade. An evolutionarily stable strategy [41] is one for which no mutant can invade. Under the assumptions normally used, the evolutionarily stable strategy is the one that maximizes R [42]. The assumption that populations will be composed mainly of types with the highest number of offspring has been applied successfully to analyzing a number of biologi- cal systems [43] [37]. However, it only applies to systems where the instantaneous change in frequency is sufficient to determine the long-term composition of the popu- lation. Systems for which this condition does not hold cannot be analyzed by assuming that selection maximizes conventional measures of fitness, thus the need for a fitness measure which is a function of time scale.

References

[1] Wright, S. (1943). Isolation by distance. Genetics 28:114-138.

[2] Wright, S. (1968). Evolution and the Genetics of Populations,vol. 1: Genetic and Biometric Foundations.Univ. Chicago Press, Chicago.

[3] Malecot, G. (1948). Les mathematiques de l'heredite. Masson, Paris.

[4] Malecot, G. (1973). Isolation by distance. In: Genetic Structure of Populations (N.E. Morton, ed.), University of Hawaii Press, Honolulu.

[5] Hudson, R. R. Gene genealogies and the coalescent process. Oxford Surv. Evol. Biol. 7, 1-44 (1990).

[6] Hudson, R. R. (1991). Gene genealogies and the coalescent process. In Oxford Surveys in Evolutionary Biology (D. Futuyama and J. Antonovics, eds.), Vol. 7, 1-44.

[7] Donnelly, P. & Tavare, S. (1995). Coalescents and genealogical structure under neutrality. Ann. Rev. Genet. 29, 401-21.

[8] Barton, N.H. & Wilson, I. Genealogies and geography. Phil. Trans. Roy. Soc. B 349, 49-59 (1995).

[9] Epperson, B.K. (1993). Recent advances in correlation studies of spatial patterns of genetic variation. Evol. Biol. 27, 95-155.

90 [10] G. A. Hoelzer, J. Wallman, D.J. Melnick, J. Mol. Evol. 47, 21 (1998).

[11] Nei, M., and Takahata, N. (1993) Effective population size, genetic diversity, and coalescence time in subdivided populations. J. Mol. Evol. 37, 240-244.

[12] Notohara, M. The structured coalescent process with weak migration. J. Appl. Prob. 38, 1-17 (2001).

[13] Takahata, N. (1991). Genealogy of neutral genes and spreading of selected muta- tions in a geographically structured population. Genetics 129:585-595.

[14] Kimura, M. & Weiss, G. H. The stepping stone model of population structure and the decrease of genetic correlation with distance. Genetics 49, 313 (1964).

[15] Wilkins, J. F. & Wakeley, J. The coalescent in a continuous, finite, linear popula- tion. Genetics 161, 873 (2002).

[16] Cox, J.T. & Geiger, J. (2000). The genealogy of a cluster in the multitype voter model. Annals of Probability28: 1588-1619.

[17] Barton, N.H. and Wilson, I. (1995). Genealogies and geography. Philos. Trans. R. Soc. Lond. Ser B 349: 49-59.

[18] Maruyama, T. Analysis of population structure I. One-dimentional stepping stone models of finite length. Ann. Human Genet. Lond. 34, 201-219 (1970).

[19] Maruyama T., Analysis of population structure II. Two-dimentional stepping stone models of finite length and other geographically structured populations. Ann Human Genet Lond, 35: 179-196 (1971).

[20] Maruyama T., Distribution of gene frequencies in a geographically structured fi- nite population. I. Distribution of neutral genes and of genes with small effect. Ann Human Genet Lond, 35: 411-423 (1972).

[21] Maruyama T., Distribution of gene frequencies in a geographically structured fi- nite population. II. Distribution of deleterious genes and of lethal genes. Ann Human Genet Lond, 35: 425-432 (1972).

[22] Watterson, G. A. On the number of segregating sites in genetical models without recombination. Theor Pop. Biol. 7,256 (1975).

[23] Kimura, M. (1969). The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics 61: 893-903.

91 [24] Li, W.-H., (1976). Distribution of nucleotide differences between two randomly chosen cistrons in a subdivided population: the finite island model. Theor. Pop. Biol. 10: 303308.

[25] Slatkin, M., (1987). The average number of sites separating DNA sequences drawn from a subdivided population. Theor Pop. Biol. 32: 42-49.

[26] Nagylaki, T. (1998). The Expected Number of Heterozygous Sites in a Subdi- vided Population. Genetics 149: 1599-1604.

[27] Nowak, M. A. What is a Quasi-species? Trends in Ecology and Evolution 7, 118-121 (1992).

[28] Rosenzweig, M.L. Species Diversity in Space and Time (Cambridge Univ. Press, Cambridge, UK, 1995).

[29] Hubbell, S. P. The Unified Neutral Theory of Biodiversity and Biogeography (Princeton University Press, Princeton, 2001).

[30] Wakeley, J. & Takahashi, T. Gene genealogies when the sample size exceeds the effective size of the population. Mol. Biol. Evol. 20, 208 (2003).

[31] G. A. Watterson. Mutant substitutions at linked nucleotide sites. Adv. Appl. Prob. 14, 206 (1982).

[32] Cox, J.T. & Durrett, R. The stepping stone model: new formulas expose old myths. Ann. Appl. Prob.12: 1348-1377 (2002).

[33] Nei, M. Molecular Evolutionary Genetics (New York: Columbia University Press, 1987).

[34] Liggett, T. M. (1999). Stochastic InteractingSystems: Contact, Voter and Exclu- sion Processes.Springer Verlag.

[35] Donnelly, P. & Kurtz, T.G. Particle representations for measure-valued population models. Ann. Prob. 27:166-205 (1999).

[36] Bak P., Tang C. & Wiesenfeld K. Self organized criticality. Phys. Rev. A. 38 , 364-374 (1988).

[37] J.E. Brommer, Biol. Rev. Camb. Phil. Soc. 75, 377 (2000).

[38] 0. Diekmann, J.A.P. Hesterbeek, and J.A.J. Metz, J. Math. Biol. 28, 365 (1990).

92 [39] B. G. Murray, Oikos 44, 509 (1984).

[40] Fisher, R. A. The distribution of gene ratios for rare mutations. Proceedings of the Royal Society of Edinburgh, 50, 205-220 (1930).

[41] J. Maynard Smith, On Evolution (Edinburgh University Press, Edinburgh, 1972).

[42] B. Charlesworth, American Naturalist 107, 303 (1973); J. A. J. Metz, R. M. Nis- bet and S. A. H. Geritz, Trends Ecol. & Evol. 7, 198 (1992); H. M. Taylor, R. S. Gourley and C. E. Lawrence, Theor. Pop. Biol. 5, 104 (1974).

[43] S. C. Steams, The Evolution of Life Histories (Oxford University Press, Oxford, 1992).

93 Chapter 7: Conclusion and future work

Conclusion

I have used the method of modelling the evolving genealogical tree of a population to study patterns of diversity. Though I have paid particular attention to diversity within biological species, the method is general enough that it applies to a broader range of systems. The genealogical tree of a population can be modelled as a coalescing random walk. Each step in the random walk is another generation into the past. Sampled spatial populations can be modelled by simulating the ancestry of the samples: the random walk is initialized by placing random walkers in positions corresponding to the locations they were taken. Using this method, I showed that predictions from the model agree with experimental genetic data from Pseudomonas bacteria. A given habitat has a diversity capacity, and a population whose diversity is lower than this capacity, due to recent founding or disturbance of the population, will grow until it reaches this capacity. Initially the increase is rapid, but it slows down with time. The fraction F(t) of the diversity capacity reached by time t is v/t/A in one dimension, (log(t)/ log(A)) 2 in two dimensions, and log(t)/ log(A) in a well-mixed population, where A is the number of sites or area. Full diversity is recovered by time TA, the expected time to the most recent common ancestor of a population in a steady state. The total diversity of a population depends on the population structure - whether it is well-mixed or spatially distributed. The effect of limited dispersal on diversity can be characterized in a way that is independent of many details of the model by deter- mining the scaling of diversity with the number of sites. This also reveals the effect of habitat area on diversity. In two dimensions, diversity B(A) scales as A(log(A)) 2. It thus, grows somewhat faster than a well-mixed population whose diversity scales as A log(A), but by a relatively slowly increasing factor of log(A). In one dimension, B(A) scales as A2 , significantly faster than area. The spatial distribution of populations (that is, limited dispersal) also has a sig- nificant effect on the distribution of genetic distances between individuals. In spatial

94 populations, this distribution is a power law. This is fundamentally different from the exponential distribution found in well-mixed populations. Diversity is unevenly distributed within species. I introduced a way of quantifying the distribution of diversity: the distribution of uniqueness. This distribution is a power law; the probability of an individual being more unique than u is P(U > u) ~ u-, implying that a disproportionate fraction of the diversity is concentrated in small sub- populations. This distribution holds even when the population is well-mixed. The distribution of diversity predicted from the model matches experimental results. Small groups are of such importance to overall population diversity that even without extrin- sic perturbations, there are large fluctuations in diversity due to extinctions of these small groups. The distribution of losses of diversity within a single generation is also a power law with the same distribution as that of uniqueness. Diversity in populations with local dispersal is geographically non-uniform, with sharp boundaries between dis- tantly related individuals, without extrinsic causes such as barriers or differing local conditions. The results presented on the scaling of diversity with habitat area and its distribution in the population have important implications for the conservation of biodiversity. The uneven distribution of diversity implies that much existing diversity in a population may be missed in when sampling a population, and the most commonly used measure of diversity does not take these divergent groups into account. The results suggests identifying divergent groups within species and ensuring their survival as a strategy to conserve genetic diversity. The results also imply that habitat area plays a greater role in within-species diversity than it does in species diversity. Observations show that biodiversity as measured by the number of species scales slowly with area (roughly as 2 A' 1), but the effect on within-species genetic diversity is much more dramatic. In two dimensions, losing half the area leads to a loss of more than half the diversity, and the effect is much larger in one-dimensional habitats where a reduction of a factor of 2 causes a loss of 3/4 of the genetic diversity.

I then showed an application of the method of tracing the genealogical tree of a population to a different question in evolution: how pathogens (or predators) evolve to coexist with their hosts (or prey). In a simple host-pathogen or predator-prey model, the rate at which pathogens infect hosts evolves; higher rates give a short term advan- tage to pathogens since they can reproduce faster, but a rate that is too high leads to depletion of prey and hence the extinction of the predator. However, because types are spatially segregated from each other, the overexploiting types eventually cause their own local extinction while the evolutionarily stable type survives. Thus over the long

95 term selection favors a sustainable level of predation. Evolution in this system can be understood by considering the dynamics of strains. Reproductive success is system- atically different over the lifetime of a strain: overexploiting types initially increase rapidly but selection acts against them on long time scales. This calls for a generalized measure of fitness which includes not only the organism's type, but also the time scale as an argument. The contrast between long-term and short-term fitness may be a property of other systems which have the general property that a population depends on, and can de- plete, a resource that grows locally, and where reproduction is local. Understanding evolution in systems with a contrast between short- and long-term fitness requires one to look for mechanisms that allow phenotypes that have a short-term disadvantage to persist. I showed one possible mechanism for this: feedback between the environmen- tal change caused by the organism and selection. In the model, the local reproduction and depletion of the resource (hosts) makes it possible for some types to change their environment locally in a way that is ultimately detrimental to their own survival. The evolutionary stability of such a system can be disrupted by the addition of long-range interactions. The system may be stable for a significant density of such interactions (depending on the parameters), but there is a sudden transition to instability at a critical density. The larger the spatial scale of patches in the system, the more sensitive it is to the addition of long-range interactions.

Future work

The methods developed in this thesis are potentially applicable to other aspects of bio- diversity, and perhaps to non-biological systems. Future work will extend them to study other questions relating to why some species and environments are highly diverse and others less so, and how we can characterize the diversity that exists.

Consequences of uneven distribution of diversity. I have identified that diversity is unevenly distributed in populations. There are many consequences of this that can be investigated. In particular, there are implications for efforts to experimentally charac- terize the diversity of populations. When even careful sampling from populations can miss distinctive sub-populations carrying much of the population's diversity, how can we assess diversity? Simulations may identify opportunities for experiments whose results can best improve this characterization. Understanding the scaling may make it possible to infer the relationship between sampled and true diversity. There may also be implications for conservation. The uneven distribution of diversity suggests the pos-

96 sibility of identifying particularly unique and thus valuable organisms or small groups of organisms whose preservation will preserve a substantial fraction of the diversity of their species.

Geographic history of a population. It is widely believed that the center of diversity of a population is often the center of its origin, though this is not always the case. By further investigating spatial patterns of diversity, the work in this thesis can be extended by determining to What extent the center of origination of a population can be detected in the genetics of the present population.

Diversity in host-pathogen systems. Diversity plays an important part in the dynam- ics of host-pathogen systems, and there is the potential to model it using the methods I have developed. Outbreaks are often caused by the emergence of new strains whose genetic distinctness allows them to evade existing host immune responses. Recent stud- ies, taking advantage of new high resolution molecular fingerprinting techniques, have revealed high levels of genetic diversity in common human pathogens [1-6]. This has become increasingly relevant in epidemiology, as many recently emerged pathogens, such as HIV, are similarly characterized by a high level of diversity [7-9], a charac- teristic with important consequences for the epidemiology of the disease [10, 11] and for the design of potential vaccines [12]. There is the opportunity to combine the mod- elling of host-pathogen interactions with the modelling of their diversity. In particular, recent work [13] suggests that the diversity of HIV (unlike that of influenza for exam- ple) has not yet reached a steady state, so modelling the change in HIV diversity may be an application of my proposed work on the dynamics of non-steady-state diversity in populations.

Additional theoretical results. The approach developed in this thesis of applying re- sults from the study of random walkers might be extended by using results on systems of walkers with different kinds of interactions. For example, selection might be mod- elled with an attractive force between walkers.

Species diversity. In addition to within-species diversity, the approach presented in this thesis makes it possible to study new aspects of the problem of species diversity. The evolutionary history of life on Earth is dominated by the opposing forces of diver- sification and extinction, so understanding the dynamics of diversity is crucial to under- standing evolutionary change. Models of species diversity can be developed based on the applicable features of my within-species model (lineages would represent species rather than individuals). In particular, my finding that within-species diversity under- goes large "extinctions" even without extrinsic perturbations, such as environmental

97 changes or disease, should be applicable to species diversity as well, and I intend to investigate whether this has implications for macroevolutionary patterns. An important difference in what I plan to do is that most existing models ignore the accumulation of successive differences between species. In other words, if species A gives rise to a daughter species B, and B in turn gives rise to C, C will be more different from A than B is. This fact may however be crucial to patterns of diversity [14].

References

[1] Musser, J.M., Kapur, V., Szeto, J., Pan, X., Swanson, D.S. & Martin, D.R.. Ge- netic diversity and relationships among Streptococcus pyogenes strains express- ing serotype MI protein - recent intercontinental spread of a subclone causing episodes of invasive disease. Infection And Immunity 63 994-1003 (1995).

[2] Salama, N., Guillemin, K., McDaniel, T.K., Sherlock, G., Tompkins, L., Falkow, S. A whole-genome microarray reveals genetic diversity among Helicobacterpy- lori strains. Proc. R. Soc. Lond. B Biol. Sci. 97, 14668-14673 (2000).

[3] Blaser, M.J., Berg D.E., Helicobacterpylori genetic diversity and risk of human disease. Journalof Clinical Investigation 107, 767-773 (2001).

[4] Hinchliffe, S.J., Isherwood, K.E., Stabler, R.A. et al. TI Application of DNA microarrays to study the evolutionary genomics of Yersinia pestis and Yersinia pseudotuberculosis.Genome Res. 13, 2018.

[5] Boyd, E.F., Porwollik, S., Blackmer, F. & McClelland, M. Differences in gene content among Salmonella enterica serovar Typhi isolates. J. Clin. Microbiol. 41, 3823.

[6] Anjum, M.F., Lucchini, S., Thompson, A., Hinton, J.C.D. & Woodward, M.J. Comparative genomic indexing reveals the phylogenomics of Escherichia coli pathogens Infection Immun. 71, 4674.

[7] Gao, F., Yue, L., Robertson, D.L., Hill, S.C., Hui, H.X., Biggar, R.J., Neequaye, A.E., Whelan, T.M., Ho, D.D., Shaw, G.M., Sharp, P.M. & Hahn, B.H. Genetic Diversity Of Human-Immunodeficiency-Virus Type-2 - Evidence For Distinct Sequence Subtypes With Differences In Virus Biology. Journal Of Virology 68, 7433-7447 (1994).

98 [8] McCutchan, F.E. Understanding the genetic diversity of HIV-1. AIDS 14, S31- S44 (2000).

[9] Korber, B., Gaschen, B., Yusim, K., Thakallapally, R., Kesmir, C., Detours, V. Evolutionary and immunological implications of contemporary HIV-1 variation. Brit. Med. Bull. 58, 19-42 (2001).

[10] Peeters, M., Sharp, P.M. Genetic diversity of HIV-1: the moving target. AIDS 14, S129-S140 (2000).

[11] Tatt ID, Barlow KL, Nicoll A, Clewley JP. The public health significance of HIV- 1 subtypes. AIDS 15, S59-S71 (2001).

[12] Gaschen, B., Taylor, J., Yusim, K., Foley, B., Gao, F., Lang, D., Novitsky, V., Haynes, B., Hahn, B.H., Bhattacharya, T. & Korber, B. AIDS - Diversity consid- erations in HIV- 1 vaccine selection. Science 296, 2354-2360 (2002).

[13] Korber, B., Muldoon, M., Theiler, J. et al. Timing the ancestor of the HIV-1 pandemic strains. Science 288, 1789 (2000).

[14] Faith, D.P. (1994). Phylogenetic pattern and the quantification of organismal bio- diversity. Philos. Trans. R. Soc. Lond. Ser. B 345:45-58.

99