The Moran Process as a on Leaf-labeled Trees

David J. Aldous∗ University of California Department of 367 Evans Hall # 3860 Berkeley CA 94720-3860 [email protected] http://www.stat.berkeley.edu/users/aldous

March 29, 1999

Abstract The Moran process in may be reinterpreted as a Markov chain on a set of trees with leaves labeled by [n]. It provides an interesting worked example in the theory of mixing times and coupling from the past for Markov chains. Mixing times are shown to be of order n2, as anticipated by the theory surrounding Kingman’s coalescent.

Incomplete draft -- do not circulate AMS 1991 subject classification: 05C05,60C05,60J10 Key words and phrases. Coupling from the past, Markov chain, mixing time, phylogenetic tree.

∗Research supported by N.S.F. Grant DMS96-22859

1 1 Introduction

The study of mixing times for Markov chains on combinatorial sets has attracted considerable interest over the last ten years [3, 5, 7, 14, 16, 18]. This paper provides another worked example. We must at the outset admit that the mathematics is fairly straightforward, but we do find the example instructive. Its analysis provides a simple but not quite obvious illustration of coupling from the past, reminiscent of the elementary analysis ([1] section 4) of riffle shuffle, and of the analysis of move-to-front list algorithms [12] and move-to-root algorithms for maintaining search trees [9]. The main result, Proposition 2, implies that while most combinatorial chains exhibit the cut-off phenomenon [8], this particular example has the opposite diffusive behavior. Our precise motivation for studying this model was as a simpler version of certain Markov chains on phylogenetic trees: see section 3(b) for further discussion. The model also fits a general framework studied in [4]: see section 3(a).

1.1 The Moran chain The Moran model ([11] section 3.3) in population genetics models a popu- lation of constant size n. At each step, one randomly-chosen individual is killed and another randomly-chosen individual gives birth to a child. The feature of interest is the genealogy of the individuals alive at a given time, that is how they are related to each other by descent. In population genetics these “individuals” are in fact genes and there is also and selection structure, but our interest goes in a different direction. There is some flexibility in how much information we choose to record in the genealogy of the current population, and we will make the choice that seems simplest from the combinatorial viewpoint. Label the individuals as [n] := {1, 2, . . . , n}, and declare that if individual k is killed then the new child is given label k. The left diagram in figure 1 shows a possible genealogy, in which we keep track of the order of the times at which all the splits in all the lines of descent occured, but not the absolute times of splits.

2 level 0 t t = f7(t, 4, 7) 7 6 ¨H ¨H ¨¨ HH ¨¨ HH 5 ¨ H ¨ H @ QQ A 4 @  Q A ¡A ¡@ @ 3 ¡ A ¡ @ @ ¡@ ¡@ ¡A 2 ¡ @ ¡ @ ¡ A ¡A ¡A ¡@ 1 ¡ A ¡ A @ ¡A ¡A ¡A 0 ¡ A ¡ A ¡ A 6 4 3 7 1 4 2 6 4 3 1 4 2 6 4 7 3 1 4 2

Figure 1. A transition t → t0 in the Moran chain.

Precisely, the left diagram shows a tree t with leaf-set [n] and height n, where at each level one downward edge splits into two downward edges, and where we distinguish between left and right branches. Such a tree has n(n + 1)/2 edges of unit length. Write Tn for the set of such trees. The cardinality of this set is #Tn = n!(n − 1)! (1) We leave to the reader the (very easy) task of giving a bijective proof of (1); an inductive proof will fall out of the proof of Lemma 1 below. Interpreting the Moran model as a Tn-valued process gives a Markov chain on Tn which we call the Moran chain. Here is a careful definition. Take a tree t ∈ Tn and distinct ordered (j, k) from [n]. Delete leaf k from t, then insert leaf k into the edge incident at leaf j, placing it to the right of leaf j. This gives a new tree 0 t = fn(t, j, k). (2) Such a transition is illustrated in figure 1. Starting from the tree t in the left diagram, leaf 7 is deleted and the levels adjusted where necessary, to give the center diagram; then leaf 7 is inserted to the right of leaf 4, and levels adjusted, to give the tree t0 in the right diagram. The Moran chain is now defined to be the chain with transition probabilities 0 0 p(t, t ) = P (fn(t, J, K) = t ) where (J, K) is a uniform random distinct ordered pair from [n]. It is easy to check this defines an aperodic irreducible chain, which there- fore has some limiting stationary distribution. Our choice of Tn as the precise state-space was motivated by

3 Lemma 1 The stationary distribution of the Moran chain is the uniform distribution on Tn. Proof. From a tree t there are n × (n − 1) equally likely choices of (k, j) which define the possible transitions t → t0. To prove the lemma we need to show the chain is doubly stochastic, i.e. that for each tree t0 there are n(n − 1) such choices which get to t0 from some starting trees t. For a tree t0 (illustrated by the right diagram in figure 1) there is only one leaf k (leaf 7, in figure 1) which might have been inserted last, into a diagram like the tree t00 in the center diagram. The trees t such that deleting leaf k from t gives t00 are exactly the trees obtainable by attaching leaf k to any of the (n−1)n/2 edges of t00 and to the right or the left of the existing edge, giving a total of (n − 1)n/2 × 2 choices, as required. Remark. The final part of the argument says that the general element t ∈ Tn may be constructed from a tree in Tn−1 by attaching leaf n to one of the (n − 1)n/2 edges, to the right or the left of the existing edge. So #Tn = n(n − 1) #Tn−1, establishing (1) by induction.

1.2 Mixing time for the Moran chain

Write (Xn(m), m = 0, 1, 2,...) for the Moran chain on Tn. One of the standard ways of studying mixing is via the maximal variation distance

1 X 0 0 dn(m) := max P (Xn(m) = t |Xn(0) = t) − πn(t ) (3) t∈T 2 n t0

0 where πn(t ) = 1/#Tn is the stationary probability. Our result is most easily expressed in terms of the following random variables. Let (ξi, 2 ≤ i < ∞) be 1 independent, and let ξi have the exponential distribution with mean i(i−1) ; then let ∞ X L = ξi. (4) i=2

2 Proposition 2 (a) lim supn dn(zn ) ≤ P (L > z) < 1 for each 0 < z < ∞ 2 (b) lim infn dn(zn ) ≥ φ(z), for some φ(z) ↑ 1 as z ↓ 0. Thus mixing time (as measured by variation distance) for the Moran chain is order n2 but variation distance does not exhibit the cut-off phenomenon of [8] which usually occurs with Markov chains on combinatorial sets. As briefly sketched in the next section, what’s really going on is that

2 dn(zn ) → d∞(z)

4 where d∞(·) is the maximal variation distance associated with a certain limit continuous-time continuous-space Markov process. We might call this dif- fusive behavior, from the case of simple random walks on the k-dimensional integers modulo n, whose n → ∞ limit behavior is of this form with the limit process being Brownian motion on [0, 1]k.

1.3 Remarks on a limit process We now outline why Proposition 2 is not unexpected. In the original Moran model for a population, one can look back from the present to determine the number of steps Ln back until the last common ancestor of the present pop- −2 ulation. It is standard (see section 2.3) that n Ln converges in distribution to L. Loosely speaking, this implies that the genealogy after n2L steps can- not depend on the initial genealogy, and Proposition 2(a) is a formalization of that idea. A more elaborate picture is given by the theory surrounding Kingman’s coalescent [13, 17], which portrays the rescaled n → ∞ limit of the genealogy of the current size-n population as a genealogy C of an infinite population. Informally, what’s really going on is that

2 dn(zn ) → d∞(z), 0 < z < ∞ where d∞(·) is the maximal variation distance associated with a certain continuous-time Markov process (Ct, 0 ≤ t < ∞) whose state space is a set of possible genealogies for an infinite population. However, defining the process (Ct) precisely and proving mixing time bounds via this weak convergence methodology is technically hard. It turns out to be fairly simple to give an analysis of Proposition 2 directly in the discrete setting, by combining the standard analysis of Ln with a coupling construction, so that is what we shall do.

2 Proof of Proposition 2

2.1 Coupling from the past The proof is based on a standard elementary idea. Suppose a Markov chain (X(s)) can be represented in the form

X(s + 1) = f(X(s),Us+1) for some function f, where the (Us) are independent with common distribu- tion µ. Then X(m) = gm(X(0),U1,U2,...,Um)

5 where

gm(x, u1, . . . , um) := f(...... f(f(x, u1), u2) ...... , um).

Define d(m) as at (3).

Lemma 3 d(m) ≤ 1 − P (A(m)), where A(m) is the event

0 0 gm(x, U−m+1,U−m+2,...,U0) = gm(x ,U−m+1,U−m+2,...,U0) ∀x, x where the (U−i) are independent with distribution µ. This form of coupling has in recent years become known as coupling from the past, named because we are constructing X(0) from X(−m) and the intervening U’s, and on the event A(m) the current state X(0) does not depend on the state X(−m). Recent interest has focused on the connection between coupling from the past and perfect sampling: see [15].

2.2 The coupling construction First note the Moran chain fits into the setup above by setting

Xn(s + 1) = fn(Xn(s),Js,Ks) (5) for fn at (2) and (Js,Ks) independent uniform random ordered pairs from [n]. Given a tree t ∈ Tn, say that f is a forest consistent with t if it can be obtained by deleting some (maybe empty) set of edges of t and then taking the spanning forest on the leaves [n]. The left diagram in figure 2 illustrates a forest f consistent with the left tree in figure 1. The transition 0 t → t = fn(t, j, k) on trees extends in the natural way to a transition 0 f → f = fn(f, j, k) on forests. Figure 2 shows a transition consistent with the transition in figure 1.

0 f f = f7(f, 4, 7)

@ @ ¡@ ¡@ ¡ @ ¡ @ ¡A ¡A ¡A ¡A ¡ A ¡ A ¡ A ¡ A 6 4 3 7 1 4 2 6 4 7 3 1 4 2 Figure 2. A transition f → f 0.

6 Fix m. Consider the Moran chain (Xn(s), −m ≤ s ≤ 0) defined by (5) with some initial X(−m), and consider the forest-valued chain

Yn(s + 1) = fn(Yn(s),Js,Ks) where Yn(−m) is the forest on [n] with no edges. It is easy to check that, for each realization of the joint process (Xn(s),Yn(s)), we have that Yn(s) is consistent with Xn(s). But Yn(s) does not depend on Xn(−m), and if Yn(s) is a single tree then Yn(s) = Xn(s). Thus on the event

An(m) := {Yn(0) is a single tree } we have that Xn(0) does not depend on Xn(−m). So by Lemma 3,

dn(m) ≤ 1 − P (An(m)) (6) and we want to estimate the right side.

2.3 Analyzing the coupling For a forest f on [n] write #f for the multi-set of the number of leaves in each tree-component. So #Yn(−m) = {1, 1, 1,..., 1} and An(m) is the event that #Yn(0) = {n}. Now reconsider the original Moran model from the start of section 1.1, and run this model for time s = −m, −m + 1,..., 0. Give a different color to each individual in the initial time −m population, and then let children inherit the color of their parent. Write (Cn(s), −m ≤ s ≤ 0) for the multi-set of the number of individuals of each color at time s. It is easy to check these two processes are the same:

Lemma 4 The process (Cn(s), −m ≤ s ≤ 0) has the same distribution as the process (#Yn(s), −m ≤ s ≤ 0).

But analyzing Cn(·) is a classical and elementary topic in population genetics (e.g. [6]) which we repeat for completeness. Define (Nn(s), 0 ≤ s ≤ m) as the number of individuals at time −s who have descendants at time 0. Then Nn(·) is the Markov chain on state-space [n] with Nn(0) = n and j(j − 1) P (N (s + 1) = j − 1|N (s) = j) = n n n(n − 1) j(j − 1) P (N (s + 1) = j|N (s) = j) = 1 − . n n n(n − 1) This holds because in the Moran model in reversed time, at each step two different individuals are chosen at random to have the same parent, and

7 the other parents are distinct. Now the chain Nn(s) can be defined for 0 ≤ s < ∞. Write Ln = min{s : Nn(s) = 1}.

The event {Ln ≤ m}, is the event that all the individuals in the Moran process at time 0 are descendants of some one individual at time −m, and hence have the same color. So

P (An(m)) = P (#Yn(0) = {n}) = P (Cn(0) = {n}) = P (Ln ≤ m).

So by (6) dn(m) ≤ P (Ln > m).

On the other hand we can represent Ln as a sum of independent random variables n X Ln = ξn,i i=2 n(n−1) where ξn,i has geometric distribution with mean i(i−1) . It easily follows that 2 P (Ln > n z) → P (L > z) for L defined at (4). This establishes part (a) of Proposition 2. Part (b) is easy. For t ∈ Tn write r(t) for the number of leaves in the right-hand branch of t from the top-level split (so r(t) = 4 for the tree in the left of figure 1). Write Rn(m) for the process r(Xn(m)), stopped upon reaching value 1 or n−1. Then Rn(·) is a certain Markov chain which jumps by at most 1 each time, and which is a martingale, so

2 E(Rn(m) − Rn(0)) ≤ m.

−1 It follows that, for initial trees tn satisfying n r(tn) → r(0) ∈ (0, 1),

2 −1 if mn = o(n ) then n r(Xn(mn)) → r(0) in probability .

Part (b) follows easily.

3 Final remarks

(a) As with the analogous chains (riffle shuffle, move-to-front, move-to-root) mentioned in the Introduction, it seems plausible that one can do a more refined analysis of the Moran chain which exhibits all the eigenvalues; in another direction, one might be able to handle non-uniform distributions

8 on leaf-pairs. Indeed, Brown [4] section 6.3 observes that the non-uniform Moran chain fits into the general setting of “random walks on left-regular bands” (a type of semigroup), without analyzing this particular chain. (b) One can alternatively define the state space of the Moran chain to be the (smaller) set Te n of n-leaf cladograms. A cladogram is also a binary tree with leaf-set [n], but now we do not distinguish between left and right branches, and we do not count the overall order of splits. The Moran chain on Te n has a certain non-uniform stationary distribution, but the conclusion of Proposition 2 remains true. Cladograms are used in biological systematics [10] to represent evolutionary relationship between . Markov chain Monte Carlo methods to infer the true cladogram from data start with some “base chain” on cladograms, providing some remote motivation for studying their mixing times. A related chain designed to have uniform stationary distribution on Te n is studied in [2]; a coupling argument is used to bound its mixing time as order n3, though we expect the correct mixing time is again order n2. The proof in [2] uses a more intricate coupling argument of the same style as in this paper, but the absence of an analog of Lemma 4 makes its analysis rather harder.

References

[1] D.J. Aldous and P. Diaconis. Shuffling cards and stopping times. Amer. Math. Monthly, 93:333–348, 1986.

[2] D.J. Aldous and P. Diaconis. Longest increasing subsequences: From patience sorting to the Baik-Deift-Johansson theorem. Bull. Amer. Math. Soc., 36:413–432, 1999.

[3] D.J. Aldous and J.A. Fill. Reversible Markov chains and random walks on graphs. Book in preparation, 2001.

[4] K. Brown. Semigroups, rings and Markov chains. To appear in J. Theoretical Probability, 1999.

[5] F.R.K. Chung, R. L. Graham, and S.-T. Yau. On sampling with Markov chains. Random Struct. Alg., 9:55–77, 1996.

[6] P. Clifford and A. Sudbury. Looking backwards in time in the Moran model in population genetics. J. Appl. Probab., 22:437–442, 1985.

[7] P. Diaconis. Group Representations in Probability and Statistics. Insti- tute of , Hayward CA, 1988.

9 [8] P. Diaconis. The cut-off phenomenon in finite Markov chains. Proc. Nat. Acad. Sci. USA, 93:1659–1664, 1996.

[9] R.P. Dobrow and J.A. Fill. Rates of convergence for the move-to-root Markov chain for binary search trees. Ann. Appl. Probab., 5:20–36, 1995.

[10] N. Eldredge and J. Cracraft. Phylogenic Patterns and the Evolutionary Process. Columbia University Press, New York, 1980.

[11] Warren J. Ewens. Mathematical population genetics, volume 9 of Biomathematics. Springer-Verlag, Berlin, 1979.

[12] J.A. Fill. An exact formula for the move-to-front rule for self-organizing lists. J. Theoretical Probab., 9:113–160, 1996.

[13] J.F.C. Kingman. The coalescent. . Appl., 13:235–248, 1982.

[14] L. Lov´asz and P. Winkler. Mixing times. In D. Aldous and J. Propp, editors, Microsurveys in Discrete Probability, number 41 in DIMACS Ser. Discrete Math. Theoret. Comp. Sci., pages 85–134, 1998.

[15] J. Propp and D. Wilson. Coupling from the past: a user’s guide. In D. Aldous and J. Propp, editors, Microsurveys in Discrete Probability, number 41 in DIMACS Ser. Discrete Math. Theoret. Comp. Sci., pages 181–192, 1998.

[16] A. J. Sinclair. Algorithms for Random Generation and Counting. Birkhauser, 1993.

[17] S. Tavare. Line-of-descent and genealogical processes and their ap- plications in population genetics models. Theoret. Population Biol., 26:119–164, 1984.

[18] U. Vazirani. Rapidly mixing Markov chains. In B. Bollob´as,editor, Probabilistic Combinatorics And Its Applications, volume 44 of Proc. Symp. Applied Math., pages 99–122. American Math. Soc., 1991.

10