The Recycler: A New Technique for Perfect Sampling

James Allen Fill Mark Huber† Department of Mathematical Sciences Department of Statistics The Johns Hopkins University Stanford University jimfi[email protected] [email protected]

Abstract In classical MCMC approaches, small random changes are made in the observation until the observation has nearly For many probability distributions of interest, it is quite the stationary distribution of the chain. The Metropolis [13] difficult to obtain samples efficiently. Often, Markov chains and heat bath algorithms utilize the idea of reversibility to are employed to obtain approximately random samples from design chains with a stationary distribution matching the these distributions. The primary drawback to traditional desired distribution. Unfortunately, this standard Markov Markov chain methods is that the mixing time of the chain chain approach does have problems. is usually unknown, which makes it impossible to determine The samples generated by MCMC will not be drawn how close the output samples are to having the target dis- exactly from the stationary distribution, but only approxi- tribution. Here we present a new protocol, the randomness mately. Moreover, they will not be close to the stationary recycler (RR), that overcomes this difficulty. Unlike classi- distribution until a number of steps larger than the mixing cal Markov chain approaches, an RR-based algorithm cre- time of the chain have been taken. Often the mixing time is ates samples drawn exactly from the desired distribution. unknown, and so the quality of the sample is suspect. Other perfect sampling methods such as coupling from the Recently, Propp and Wilson have shown how to avoid past use existing Markov chains, but RR does not use the these problems using techniques such as coupling from the traditional Markov chain at all. While by no means univer- past (CFTP) [15]. For some chains, CFTP provides a pro- sally useful, RR does apply to a wide variety of problems. In cedure that allows perfect samples to be drawn from the restricted instances of certain problems, it gives the first ex- stationary distribution of the chain, without knowledge of pected linear time algorithms for generating samples. Here the mixing time. However, CFTP and related approaches we apply RR to self-organizing lists, the , ran- have drawbacks of their own. These algorithms are non- dom independent sets, random colorings, and the random interruptible, which means that the user must commit to cluster model. running such an algorithm for its entire (random) running time even though that time is not known in advance. Failure to do so can introduce bias into the sample. Other algo- 1 Introduction rithms, such as FMMR [3], are interruptible (when time is measured in Markov chain steps), but require storage and subsequent rereading of random bits used by the algorithm. The Markov chain Monte Carlo (MCMC) approach to The method we will present is both interruptible and “read- generating samples has enjoyed enormous success since its once,” with no storage of random bits needed. introduction, but in certain cases it is possible to do bet- In addition, algorithms like CFTP and FMMR require an ter. The “randomness recycler” technique we introduce here underlying Markov chain, and can never be faster than the (and whose name is explained in Section 2) works for a va- mixing time of this underlying chain. Often these chains riety of problems without employing the traditional Markov make changes to parts of the state where the state has al- chain. Our approach is faster in many cases, generating in ready been suitably randomized. This leads to wasted effort particular the first algorithms that have expected running when running the algorithm that often adds a log factor to time linear in the size of the problem, under certain restric- the running time of the algorithm. tions. The randomness recycler (RR) is not like any of these perfect sampling algorithms. In fact, the RR approach aban- Research supported by NSF grant DMS-9803780 and by the Acheson J. Duncan Fund for the Advancement of Research in Statistics. dons the traditional Markov chain entirely. This is what †Supported by NSF Postdoctoral Fellowship 99-71064. allows the algorithm in several cases to reach an expected

running time that is linear, the first for several problems of has the correct distribution over (the subgraph induced by) interest. The RR technique gives interruptible, read-once the vertices in Vt. perfect samples. In the next section we illustrate the randomness recycler Randomness Recycler for Independent Sets for the problem of finding random independent sets of a graph. After this example we present in Section 3 the gen- Set V0 , x0 0, t 0 eral randomness recycler procedure and present a (partial) Repeat ∅ proof of correctness. In Section 4 we present other applica- Set xt+1 xt tions and in Section 5 we review the results of applying our Choose any v V V ∈ \ t new approach to several different problems. Set Vt+1 Vt v Draw U uniform∪l{y a}t random from [0, 1] 2 Weighted Independent Sets If U 1/(1 + ) Letx (v) 0 t+1 We begin by showing how the randomness recycler tech- Else Let x (v) 1 nique applies to the problem of generating a random inde- t+1 pendent set of a graph. This will illustrate the key features If a neighbor w of v has xt+1(w) = 1 of RR and lay the groundwork for the more general proce- Let w be the lowest-numbered such neighbor Set x (w) 0, x (v) 0 dure described in the next section. t+1 t+1 Recall that an independent set of a graph is a subset of Remove from Vt+1 the vertices v and w, vertices no two of which share an edge. We will represent all neighbors of w, and all neighbors of v an independent set as a coloring of the vertices from 0, 1 , with numbers less than that of w { } Set t t + 1 denoted generically by x. Set x(v) = 1 if v is in the inde- pendent set, and x(v) = 0 if v is not in the independent set. Until Vt = V

This implies that v x(v) is the size of the independent set. (In advance of running the algorithm, choose and fix a We wish to sample from the distribution numbering of the vertices.) The algorithm proceeds induc- P x(v) tively as follows. At the outset of step t + 1, we begin with v (x) = , an independent set xt of Vt chosen with the correct prob- PZ ability. Then we choose a vertex v not in Vt to attempt to where (called the fugacity) is a parameter of the prob- add. This vertex may be chosen in any fashion desired (ran- lem, and Z is the normalizing constant needed to make domly, or according to some fixed order, but not depend- a probability distribution. ing on the independent set xt). Because the desired prob- This distribution is known as the hard core gas model ability of choosing an independent set x is proportional to x(v) in , and also has applications in stochastic , putting xt+1(v) = 1 has times the weight of loss networks [11]. When is large the sample tends to be puPtting xt+1(v) = 0. Therefore we select xt+1(v) = 1 a large independent set, and if > 25/ where is the with probability /(1 + ) and xt+1(v) = 0 with probabil- maximum degree of the graph, it is known that generating ity 1/(1 + ) (these are the heat bath probabilities). samples from this distribution cannot be done in polynomial Unfortunately, the vector xt+1 resulting from this selec- time unless NP = RP [1]. tion may fail to correspond to an independent set. At line We will show that for < 4/(3 4) the randomness 11 of the pseudocode, we check whether some neighbor of v recycler approach gives an algorithm with expected running was already colored 1 (in the independent set). Note that we time linear in the size of the graph, the first such result for cannot simply remove v. Prior to the step, we knew that xt this problem. was an independent set of Vt. If we observe that xt(w) = 1 The RR approach is to start not with the entire graph, for some lowest-numbered neighbor w of v, then xt is an but rather with a small graph where we can easily find an independent set on Vt conditioned on this knowledge. independent set from this distribution. For example, if a Our solution is this: In line 14 we “undo” the knowledge graph has only a single vertex, finding an independent set gained by removing from Vt+1 the vertices v and w, all the is easy. Starting from a single vertex, we attempt to add neighbors of w, and all the neighbors of v with number less vertices to the graph, building up until we are back at our than that of w. On the remaining vertices of Vt+1, xt+1 will original problem. Sometimes we fail in our attempt to build continue to be an independent set from the correct distribu- up the graph, and indeed will then also need to remove ver- tion. We will say that an RR step of this type preserves the tices that we had previously added. The set Vt will comprise correct distribution. those vertices we have built up by the end of time step t. Af- Note that although Vt+1 is made smaller than Vt in the ter step t, the vector xt will hold an independent set which case of a conflict, we are able to salvage most of the vertices

2

in Vt. In other words, we “recycle” the randomness built up sure that as few vertices as possible are removed in the re- in all of the vertices except v and w and some neighbors. jection step. Note that we may assume that the graph is con- This is where our approach gets its name, and “recycling” nected, since otherwise we simply work on each connected is the key new feature that enables us to contruct similar component separately. Therefore V V0 is connected, and practicable algorithms for a wide variety of problems. by being slightly careful in how we c\hoose v V V , we ∈ \ t We repeat until Vt = V . Because each step preserves the can ensure that V Vt remains connected at each step. In correct distribution, we know that x will have the correct every step (except \when V V = 1), the vertex v is not t | \ t| distribution at the end. This is proved formally in the next adjacent to vertices in Vt, but only to at most 1, so section; here we concentrate on bounding the running time fewer vertices are removed during rejection. Since the ver- of our procedure. tices removed from Vt in case of rejection are connected to V Vt, V Vt+1 will also be connected whether we accept or Theorem 1 If < 1/(2 1), then the expected running \ \ reject, and no more than an extra constant amount of work time of the above randomness recycling procedure for ran- is required at each step. dom independent sets is O(n). The second alteration concerns how we look for the neighbor of v that is colored 1 in case of rejection. Instead A more careful statement of Theorem 1 is given follow- of starting at the lowest numbered neighbor and working ing the proof. our way up, we start at a random neighbor and continue Proof We will show that for this small, on average Vt looking in cyclical order until we find our w colored 1; and increases at each step. If U 1/(1 + ), then the s|ize| then the vertices that we remove are w and its neighbors of Vt goes up by 1, but if U > 1/(1 + ), then the size and v and its neighbors encountered in the search prior to of |V | may decrease by at most 2 1 [removing v (not | t| finding w. On average (and together with the first trick), we previously included), w, and some neighbors]. Hence need only look at /2 vertices in order to find w. 1 Finally, in our analysis we kept track of a potential E Vt+1 Vt Vt, xt (1) (2 1) (V , x ) = V and showed that increases on average. | | | | 1 + 1 + t t t When we acc|ept| we sometimes add a vertex colored 1 to 1 = [1 (2 1)], our set V ; but when we reject, precisely one vertex col- 1 + t ored 1 (namely, w) is removed. This suggests that we mod- which is positive precisely when < 1/(2 1). Given ify so that the acceptance and rejection phases both lead an increase of Vt on average at each step, standard mar- (in the worst case) to the same expected change in . We tingale stopping| th|eorems (see, e.g., [14]) show that after will consider O(n) expected time the value of V will be n, at which | t| point and the algorithm terminates. 2 (Vt, xt) = Vt xt(v) Vt = V | | v More carefully, if X and seek a suitable value of . The expected change in 1 [1 (2 1)] (0, 1), if no neighbor of v is colored 1 is 1 [/(1 + )]. If 1 + ∈ some neighbor is colored 1, then the expected change is at 3 2 i.e., if least 1 [1/(1 + )] + + . [The term 1+ 2 2 1 1 , (3 2)/2 is an upper bound on the expected decrease 1 in V, since (see above) on average we lose at most /2 t nei|ghb| ors of v and 1 neighbors of w.] then the expected value of the running time T (measured by number of iterations of the Repeat loop) satisfies These two expressions may be made equal by setting = 3/4, and then the expected change in will be posi- ET n/. tive when 4 < . Furthermore, a simple argument shows that the distribution 3 4 of T has at worst geometrically thick tails: Note that at time 0 equals 0, and can never be more m m than n, and we have shown that is expected to increase P (T 2 n) 2 , m = 1, 2, . . . . by a fixed positive amount at each step (when we formulate Several tricks may be used to either to improve our carefully as in the paragraph following the proof of Theo- method or to improve our bounds on its performance. The rem 1). This fact together with standard martingale stop- first two we present work by altering the algorithm, and the ping theorems can then be used to show that the expected third gives a better analysis. First we concentrate on making time needed for V to equal n is at most linear in n. | t|

3

2.1 Markov chain approaches Vt = V , at which point we have a sample drawn exactly from the desired distribution. Let Xt denote the coloring of Several Markov chains for this problem exist [12] [2], the graph V at time t. together with techniques for using CFTP to obtain perfect The way in which we randomly choose c, compute the samples [6] [7]. These Markov chains are known to mix acceptance probability, and set Vt+1 and Xt+1 V V in | \ t+1 in time O(n ln n), and the corresponding perfect sampling case of rejection will all depend on the target distribution . algorithms are known to to run in time O(n ln n), when What differentiates this algorithm from an elementary step- < 2/( 2), which is a larger range of than our method wise rejection approach is our rejection step. Rather than gives. However, when is small enough, the O(n) bound starting over when rejection is faced, we keep as much of for our RR algorithm is smaller. It is hoped that with fur- Vt as possible, “recycling” the coloring on Vt+1. ther refinement of the rejection step, the range of may be At each time step t we keep track of the vertex set Vt increased to where it matches the Markov chain analysis. together with the colors that are fixed on V Vt. The state X is random over V while on V V it is\deterministic. t t \ t Let Xt = (Vt, Xt V Vt ), and for any possible value x = 3 The Randomness Recycler | \ (S, x V S) of Xt, let x be conditionally given that the color|s o\f V S are as specified by x . We now present a more general outline of the random- V S To achiev\e both the desired distr|ib\ution and interrupt- ness recycler technique. Many state spaces of interest are ibility, we want X to be random over V independent of the of the form CV , where CV is the set of (proper or t t history X for t0 < t. In other words we want the identity improper) colorings of a graph. Our goal is to sample from t0 in expected time linear in V . We have already seen how P (Xt = xt X0 = x0, . . . , Xt = xt) = x (xt), (1) the independent sets of a gr|aph| may be encoded by color- | t ing a vertex 1 if it is in the indpendent set and 0 otherwise. to hold. Indeed, if it does, then letting T denote the first For another example, the set of permutations of n elements time that VT = V , it follows easily that 1,...,n is a subset of 1, . . . , n { }. Of course, the size of { }V P (XT = x T = t) = (x). may be as large as C | |, and this is in part what makes | | | generating samples from these distributions difficult. Thus if (1) is satisfied for all t, then at termination time T the RR algorithm returns a sample XT that is distributed The Randomness Recycler (Outline) according to the desired distribution, and we have the inter- ruptibility property that T and XT are independent random Set V0 , X0 suitable x0, t 0 variables. ∅ Since V is empty, it is easy to begin with X from . Repeat 0 0 x0 Set X X For notional convenience, let Ht := (X0 = x0, X1 = t+1 t Choose v V Vt x1, . . . , Xt = xt). We will say that step t + 1 preserves Randomly∈cho\ose color c for v the correct distribution if Compute probability of accepting color c for v P (Xt = xt Ht) X (xt) If we accept | t Set X (v) c t+1 ⇓ Set Vt+1 Vt v P (X = x H ) (x ). t+1 t+1 t+1 xt+1 t+1 Else ∪ { } | This requirement that RR preserve the correct distribu- Set Vt+1 and Xt+1 V Vt+1 in a way that ‘undoes’ the effec|t o\f rejection tion is somewhat analogous to the design requirement that Set t t + 1 a Markov chain be reversible. It gives us a straightforward approach to designing an RR. Until Vt = V Just as the heat bath approach gives a means for design- In an RR algorithm, a sample (i.e., one draw from ) is ing Markov chains that are reversible, it also gives us a built up one vertex of V at a time until we include all of the method for designing RR algorithms that preserve the cor- vertices. Let Vt be the subset of vertices on which we have rect distribution. For a specified vertex v V and col- already built up a sample at time t. On the vertices in V V , oring x, let ( ; x) denote the conditional p∈robability dis- \ t v the sample is fixed at some value, whereas on Vt, the sample tribution of X(v) given that X V v = x V v when X is random, and drawn exactly from the desired distribution. has the stationary distribution |. \F{ro}m curr|en\t{ s}tate x, the Vt starts out empty, and at each step of the algorithm we heat bath (or Gibbs sampler) Markov chain approach is to attempt to add a vertex to Vt. Sometimes this is possible, choose v uniformly at random and then choose a new color and sometimes it is not. We continue in this fashion until for v distributed according to ( ; x). v

4

In heat bath RR, the vertex v is chosen any way the user Proof Let C := 1/P (X = x H ), and suppose 1 t+1 t+1| t desires from V Vt, and then a new color is picked accord- that P (Xt = xt Ht) = x (xt). Let E be the event that \ | t ing to v( ; x). However, this color is not always accepted. Xt+1 = xt+1 and Xt+1 = xt+1. Then We compute the acceptance probability as follows, with the goal being to preserve the correct distribution. According P (Xt+1 = xt+1 Ht+1) | to Theorem 2 below, this goal is indeed met. = C P (X = x , X = x H ) 1 t+1 t+1 t+1 t+1| t Given values x, xt, x , and xt+1 that correspond to t t+1 = C1P (E Xt = xt Ht) a possible acceptance step in which vertex v is added to the ∩ { }| = C1P (Xt = xt Ht)P (E Ht Xt = xt ) growing vertex set, define (xt, xt, xt+1, xt+1) to be the | | ∩ { } ratio = C1x (xt)P (E X = x, Xt = xt) t | t t = C C (x ), (x ) 1 xt+1 t+1 xt+1 t+1 (x, x , x , x ) := . t t t+1 t+1 (x (v); x ) (x ) v t+1 t xt t where the last step is exactly our assumption. Also define Note that neither C1 nor C depends on xt+1. Hence, summing over xt+1, M(xt, xt+1) := max (xt, xt, xt+1, xt+1). xt,xt+1 1 = C C (x ) = C C. 1 xt+1 t+1 1 Then the probability that we accept a possible tran- x Xt+1 sition from (xt, xt) to (xt+1, xt+1) is taken to be This completes the proof. 2 (xt, xt, xt+1, xt+1)/M(xt, xt+1). We do not have to use the heat bath probabilities. It is also valid to use the same acceptance probability, with Theorem 2 The heat bath RR and arbitrary RR acceptance the distributions v( ; x) replaced by arbitrary distributions steps preserve the correct distribution. p ( ; x), when the distribution p ( ; x) is used to color a se- v v lected v when at a configuration x. Proof The acceptance probabilities were chosen precisely While these acceptance probabilities may appear daunt- to match the requirements of Lemma 1. For instance, with ing, for many problems they simplify considerably. For in- heat bath RR, the left side of the equation in Lemma 1 stance, in the independent set case, suppose first that v has equals no neighbor colored 1. Then the heat bath probabilities are (x ) 1/(1 + ) for color 0 and /(1 + ) for color 1. The accep- xt+1 t+1 tance probability in this first case will always be 1. v(xt+1(v); xt) v(xt+1(v); xt)x (xt)M(xt, xt+1) If instead some neighbor of v is colored 1, then heat bath t x (xt), assigns probability 1 to the color 1. The acceptance prob- t ability, however, works out to 1/(1 + ). Careful exami- which reduces to the right side of the equation with C = nation of the independent set algorithm in Section 2 shows 1/M(xt, xt+1). The calculation for arbitrary RR is entirely that this is exactly how the color for v is chosen, with the similar. 2 same acceptance probabilities. To show that the heat bath randomness recycler approach Now we turn our attention to rejection steps. In design- actually works (in general), we need to show that every step ing an RR algorithm, it is our experience that proper han- preserves the correct distribution. We will first consider ac- dling of rejection steps to ensure preservation of the cor- ceptance steps, for which the following lemma gives a suf- rect distribution is more difficult and problem-specific to ficient condition. arrange than is proper handling of acceptance steps. But here are some broad guiding comments. Lemma 1 Given possible values x , x , and x of X , t t+1 t+1 t Determination of the acceptance probability at step t + 1 , and corresponding to an acceptance step, sup- Xt+1 Xt+1 will reveal knowledge about the colors of some subset, call pose that only one value x of X has positive probability. t t it D , of V . If we reject, we then set V to be V D . If the bivariate process evolves Markovianly t t t+1 t t (Xt, Xt)t 0 This insures that when we reject, we do not bias the sam\ ple. and if for all such x , x , and x and the single x they t t+1 t+1 t That is, by removing D from V , we remove all traces of determine we have t t our knowledge gained, and as a result the remaining sample

P (X = x , Xt+1 = xt+1 X = x, Xt = xt)x (xt) is drawn exactly from Vt+1 . t+1 t+1 | t t t In the case of the independent sets, the set Dt consists = x (xt+1)C, t+1 of precisely those vertices prescribed to be removed by the where C does not depend on xt or xt+1, then step t + 1 algorithm: w and all its neighbors, and neighbors of v with preserves the correct distribution. numbers lower than that of w. [Indeed, all of these vertices

5

are colored 0 at time t, except for vertex w, which is col- The differs from the Ising model in that ored 1.] It is not hard to check rigorously in this case that more than two colors are used, but the energy depends (in rejection steps also preserve the correct distribution, but we a natural way) only on whether edges are colored concor- omit the details. dantly or discordantly, and the running time Theorem 3 re- mains valid verbatim. 4 Applications The Random Cluster Model The random cluster model This section applies the randomness recycler approach is an extension of the Potts model to noninteger numbers of to several different problems of interest. For some of these colors [4]; this is discussed further below. Unlike our pre- models we have theoretical bounds on the running time, vious examples, which colored vertices, the random cluster while for others we have only experimental results. model colors edges of a given graph G = (V, E) with col- ors from 0, 1 . If A is the set of edges colored 1, then the { } The Ising and Potts models In the Ising model, vertices distribution is in a graph (V, E) are colored from the set 1, 1 . The { } A E A c(A) distribution from which we wish to sample is defined by (A) := p| |(1 p)| \ |q /Zp,q, A E, exp JH(x) where p [0, 1]; q > 0 is not necessarily an integer, and (x) := { }, ∈ Z we shall assume q > 1; c(A) is the number of connected components in the graph (V, A); and Zp,q is a normalizing where constant. H(x) := x(v )x(v ) 1 2 The RR approach is as follows. We represent a set A v ,v E { 1X2}∈ E by a binary vector x, by setting x(e) = 1 for e A, is known as the energy of the coloring, is (proportional to) and x(e) = 0 otherwise. At each step, we keep tra∈ck of a postive parameter known as inverse temperature, and J such a vector xt and a set Et of edges, namely, the edges is 1 in the ferromagnetic model and 1 in the antiferro- on which xt is random; all other edges will be colored 0. magnetic model. Generating approximate samples may be We choose an oriented edge e = (v, w) E E , until ∈ \ t done in (nonlinear) polynomial time in the ferromagnetic such an edge e no longer exists. We set xt+1(e) = 1 with case using Markov chain techniques of Jerrum and Sin- probability p, and x (e) = 0 with probability 1 p. If v t+1 clair [10] [16]. and w are already connected in xt [i.e., in the graph (V, At) The RR approach has provably linear expected running where At = e0 : xt(e0) = 1 Et], then we accept time for both the ferromagnetic and antiferromagnetic mod- the edge and s{et E = E } e. If v and w are not t+1 t ∪ { } els when is small (i.e., the temperature is high). The set already connected, then we always accept xt+1(e) = 0, but Dt to be removed from Vt in case of rejection is just the set we accept xt+1(e) = 1 only with probability 1/q (since by of neighbors of the vertex v that we tried to add. Omitting adding this edge we reduce by 1 the number of connected details and proofs, we simply state the running time bound components). in the following theorem. When we reject, we know that v and w lie in separate components in (V, A ). To counteract this knowledge, to Theorem 3 Let be the maximum degree of the graph. If t form Et+1 we remove from Et all the edges in the com- 1/ ponent of (V, At) that contains w, together with all edges 1 e < 1 + , of Et that lead out of this component (and which therefore do not belong to At). then the expected running time of the heat bath RR proce- We could cease our handling of a rejection step at this dure for the Ising model is O(n). point and prove that (a) the algorithm works correctly and (b) Theorem 4 below holds (and the proof simplifies some- Comments like those following the proof of Theorem 1 what) with the bound on p decreased to 1 apply here, where now the expected increase 1+ [1 (2 1)] in Vt becomes ( + 1)e . p < 1/( (1/q)). Altho| ug|h not needed for the theorem, in practice it helps to introduce a third color 0 to supplement 1, 1 . Notice However, we shall omit the formal proof of correctness and that no edge with an endpoint colored 0 co{ntribu}tes to H. instead discuss a small (provably valid) trick which gains us At the completion of step t, every vertex in V Vt which is some efficiency. surrounded entirely by vertices in V V may\be recolored Suppose that there are M vertices in the removed com- \ t 0 since this action does not affect the vertices in Vt at all. ponent. Consider the (connected!) graph consisting of the

6

vertices and edges in this component, together with the ver- goes up by 1, making this case uninteresting. It is when tex v and the edge v, w . Choose (in any fashion) a span- v, w would connect two previously unconnected compo- { } { } ning tree T of this graph; T will comprise M + 1 vertices nents of At that the calculation becomes interesting. and therefore M edges. Add back in these M edges to If the edge is chosen to be excluded from At+1, then get Et+1. Sample from the random cluster model on T , increases by 1. If the edge is proposed to be included in and add back in the edges thereby colored 1 to get A . A , then changes by 1 if we accept. If we reject, t+1 t+1 The key observation here is that it is elementary to sam- we remove from At (and also from Et) a component of size ple from the random cluster model when the graph is a M and (from Et) all of its adjacent edges. Not counting tree. Indeed, then each edge independently is colored 1 the edge v, w and making sure that we do not double- with probability /(1 p + ) and 0 with probability count, this{total}s at most M( 1) edges removed from E . t (1 p)/(1 p + ), where However, we add exactly M 1 new components to At by removing these edges. When we add the tree T back := p/q. in, this produces M new edges for Et+1, but for each such edge there is a /(1 p + ) chance of including the edge The random cluster model is an extension of the ferro- in At+1 and thereby reducing the number of components magnetic Ising and Potts models. When q > 1 is an integer, by 1. Therefore, when we attempt to add v, w to At+1, and p = 1 exp , then samples from the random clus- { } { } but reject instead, the expected contribution to the change ter model may be used to generate samples from the ferro- in is at least magnetic Potts model with q colors by independently taking each connected component of , uniformly choosing (V, A) M( 1) + M + M 1 M . one of the q colors, and assigning to every vertex in the com- 1 p + ponent that color. For certain instances of the random clus- Now M may be very large (nearly as large as n), so we ter model, the heat bath Markov chain approach is believed choose in such a way that the coefficient of M in this from experimental evidence to be rapidly mixing [15], but expression vanishes. That is, we set no theoretical rapid mixing results in the positive direction are known for any nontrivial instances of the problem. For 1 p + := ( 2) , some instances, the Markov chain approach is known not to 1 p be rapidly mixing [5]. For the RR approach, we know that and so the contribution in this case is bounded below by when is small (corresponding to small ), the approach p . takes an expected number of steps which is linear in the We try to put the edge in with probability p and to leave number of edges: it out with probability 1 p. We accept an inclusion with Theorem 4 Suppose that probability 1/q. Putting everything together, we find that the expected change in at any time step when v and w are (1/q) [ (1/q)]2 4[1 (1/q)]( 1) not already connected in At is at least p < . 2[1 (1/q)]( 1) 1 1 p (1 p) + p (1 ) + 1 ( ) , q q Then the expected number of steps required by the RR algo- rithm is O( E ). which is positive exactly when | | (1/q) [ (1/q)]2 4[1 (1/q)]( 1) For example, if = 4 (as on a 2-dimensional rectangu- p < . lar grid) and q = 2 (corresponding to the Ising model), then 2[1 (1/q)]( 1) p our restriction is that p < 1/3; this improves on the restric- 2 tion p < 1/( (1/q)) = 2/7 obtained when the “add a tree” trick is notemployed. In this case, the Markov chain approach does not have Comments analogous to those following the proof of theoretical guarantees on the running time for any nontrivial Theorem 1 again apply. value of p. While coupling from the past may also be used to generate perfect samples, there is no a priori bound on its Proof We use a potential function that rewards us for running time. adding edges and penalizes us for connecting components. As with CFTP, we may still use the RR approach for val- Let ues of p for which no theoretical bound exists. We simply (E , A ) := E c(A ), t t | t| t do not know beforehand how long the algorithm will take. where will be determined later. Unlike CFTP, the RR approach is interruptible, so we may When the edge v, w we attempt to add to E is be- abort the procedure if it needs too many steps, without in- { } t tween two vertices already connected in At, then always troducing bias into the sample.

7

Proper colorings of a graph Finding the number of the front of the list, the order of the list is left unchanged.) proper colorings of a graph is a #P-complete problem [9]. The self-organization rule we have described is called the Recall that a proper coloring of a graph assigns each ver- Move Ahead 1 (MA1) rule. The limiting expected access tex a color such that no edge has both endpoints colored the time for MA1 is known to be, for any access probability same color. The ability to sample from the set of proper vector p, no more than that for MTF [17]. Further Monte colorings leads to an approximation algorithm for counting Carlo study of the limiting access time distribution is com- the number of such colorings. plicated by the fact that sampling from the limiting list- Markov chain approaches require that k, the number of order distribution (for which a formula is known, but only colors, be at least (11/6) (where is again the maxi- up to a normalizing constant) seems to be quite difficult in mum degree of the graph) [19] in order to guarantee rapid general. mixing for the chain. Perfect sampling using bounding Coupling from the past approaches to sampling from chains [7, 6] is only guaranteed to run in polynomial time exist [8], but experimental evidence suggests that use of RR 2 i when the number of colors is ( ). Unfortunately, the gives a faster algorithm. Suppose that pi r for some straighforward RR approach does not match these bounds. ratio 0 < r 1. Then experimental evidenc∝e suggests that Somewhat roughly stated, for each fixed value of r (0, 1] the expected running time is linear in n, although t∈he constant of linearity does vary Theorem 5 The heat bath RR approach to generating per- with r. The Markov chain approach to this problem is only fect colorings requires only a linear expected number of known to be rapidly mixing when r < 0.2 [8]. steps when k is (4).

As with the bounding chain procedure, however, this al- 5 Conclusion gorithm may be run even when k is much smaller; we sim- ply have no reasonable a priori bound on the running time The RR approach to perfect sampling gives exact sam- in such cases. ples from difficult distributions without using the traditional Markov chain. It is quite different from other recent ap- The Move Ahead 1 chain Finally, we present a problem proaches to perfect sampling such as coupling from the past. where an RR-based algorithm seems experimentally to run Because it dispenses with the Markov chain, the RR ap- fast although we cannot give any theoretical bounds. In the proach yields, for restricted versions of some of these prob- list update problem, a set of items is kept in a list. To access lems, the first expected linear time algorithms for these an item, a user starts at the beginning of the list and steps problems. Even when the running time of RR is unknown, through the items until the desired item is located. The lo- the algorithm may be run and the output will be guaranteed cated item may be replaced in the list anywhere between its to come from the correct distribution. current position and the front of the list, at fixed cost. The Unlike coupling from the past, RR is interruptible, so the goal is to use a replacement strategy that keeps small the user may set a time limit on the algorithm’s running time (if access times (i.e., item depths in the list) needed for items. measured in number of iterations of the basic Repeat loop) Call the strategy which moves the accessed item to the without introducing bias into the sample. Like read-once front of the list the Move to Front (MTF) rule. A worst-case coupling from the past [20], this algorithm does not require analysis shows that the MTF rule yields a 2-approximation storage of any random bits. (Another perfect sampling ap- for the optimal total access time for any sequence of item proach, that of Fill, Machida, Murdoch, and Rosenthal [3] requests [18]. Alternatively, it is useful to employ proba- is also interruptible but not read-once, and so does requires bilistic models to describe how list items are chosen to be storage of random bits). We wish to stress that these ex- accessed. Commonly, such an access model will induce a isting means for perfect sampling rely on finding a “good” Markov chain model on the evolution of the order of the list. Markov chain for the problem at hand. RR does away with Characteristics such as the limiting distribution as t the chain, and in doing so breaks the O(n ln n) barrier that → ∞ of At, where At is the access time for the item accessed at has characterized so many of these problems. time t, can then be estimated by drawing from the stationary For independent sets and for proper colorings, the the- distribution of the chain. oretical bounds obtained apply only for a more restricted To be specific, label the items with identification num- set of parameters than do those based on Markov chain ap- bers 1, . . . , n; suppose that at each time step, independently proaches. However, computer experiments show that when of previous time steps, any particular item i is accessed with the appropriate restriction is met, our RR method is faster, probability pi > 0 (independently of the order of the list); yielding samples in a linear (expected) number of steps. and suppose that after each selection, the accessed item is Moreover, much work has gone into analyses of Markov moved forward one rank in the list, i.e., is transposed with chains, while our work is still rather new, and we might its predecessor in the list. (If the accessed item is already at hope with time and further effort eventually to match or

8

even to relax the restrictions needed for the Markov chain [10] M. Jerrum and A. Sinclair. Polynomial-time approxi- approaches. For the Move Ahead 1 chain we do not know mation algorithms for the Ising model. SIAM Journal any theoretical bounds on the running time of our method. on Computing, 22:1087–1116, 1993. However, for this problem the RR method works much bet- ter in practice than does the CFTP method. [11] Frank Kelly. Loss networks. Annals of Applied Prob- For the random cluster model, our RR technique is guar- ability, 1(3):319–378, 1991. anteed to run in a linear (expected) number of steps for a [12] Michael Luby and Eric Vigoda. Fast convergence of range of values of p. This is in sharp constrast to the Markov the Gauber dynamics for sampling independent sets. chain approach, where no polynomial running time bounds Random Structures & Algorithms, 15:229–241, 1999. are known except in trivial cases. In summary, the randomness recycler is not applicable [13] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, in all situations where Markov chain approaches are used, A.H. Teller, and E. Teller. Equation of state calcula- but RR often gives a fast read-once interruptible means for tion by fast computing machines. Journal of Chemical generating perfect samples that in restricted cases gives the Physics, 21:1087–1092, 1953. first linear time algorithms for some difficult and important [14] Sidney C. Port. Theoretical Probability for Applica- problems. tions. Wiley–Interscience, 1994. References [15] James G. Propp and David B. Wilson. Exact sampling with coupled Markov chains and applications to sta- tistical mechanics. Random Structures & Algorithms, [1] Martin Dyer, Alan Frieze, and Mark Jerrum. On 9(1–2):223–252, 1996. counting independent sets in sparse graphs. In Proc. of 40th Foundations of Computer Science, 1999. [16] Dana Randall and David Wilson. Sampling spin con- figurations of an Ising system. In Tenth Annual ACM- [2] Martin Dyer and Catherine Greenhill. On Markov SIAM Symposium on Discrete Algorithms, 1998. chains for independent sets. Journal of Algorithms, 35(1):17–49, 2000. [17] Ronald Rivest. On self-organizing sequential search heuristics. Communications of the ACM, 19:63–67, [3] James A. Fill, Motoya Machida, Duncan J. Murdoch, 1976. and Jeffrey S. Rosenthal. Extension of Fill’s perfect rejection sampling algorithm to general chains. Ran- [18] Daniel D. Sleator and Robert E. Tarjan. Amortized dom Structures & Algorithms, 2000. To appear. efficiency of list update and paging rules. Communi- cations of the ACM, 28:202–208, 1985. [4] C.M. Fortuin and P.W. Kasteleyn. On the random clus- ter model I: Introduction and relation to other models. [19] Eric Vigoda. Improved bounds for sampling colorings. Physica, 57:536–564, 1972. J. of Mathematical Physics, 41(3):1555–1569, 2000. [20] David B. Wilson. How to couple from the past using a [5] V. Gore and M. Jerrum. The Swensen-Wang process read-once source of randomness. Random Structures does not always mix rapidly. Journal of Statistical & Algorithms, 16(1):85–113, 2000. Physics, 97(1–2):67–86, 1999.

[6] Olle Ha¨ggstro¨m and Karin Nelander. On exact simula- tion from Markov random fields using coupling from the past. Scand. J. Statist., 26(3):395–411, 1999.

[7] Mark L. Huber. Exact sampling and approximate counting techniques. In Proc. of the 30th Symposium on the Theory of Computing, pages 31–40, 1998.

[8] Mark L. Huber. Perfect Sampling with Bounding Chains. PhD thesis, Cornell University, 1999.

[9] M. Jerrum. A very simple algorithm for estimating the number of k-colourings of a low-degree graph. SIAM Journal on Computing, 22:1087–1116, 1995.

9