<<

Stochastic Approximation in Monte Carlo Computation

Faming LIANG, Chuanhai LIU, and Raymond J. CARROLL

The Wang–Landau (WL) algorithm is an adaptive Monte Carlo algorithm used to calculate the spectral density for a physical system. A remarkable feature of the WL algorithm is that it is not trapped by local energy minima, which is very important for systems with rugged energy landscapes. This feature has led to many successful applications of the algorithm in statistical and biophysics; however, there does not exist rigorous theory to support its convergence, and the estimates produced by the algorithm can reach only a limited statistical accuracy. In this article we propose the stochastic approximation Monte Carlo (SAMC) algorithm, which overcomes the shortcomings of the WL algorithm. We establish a theorem concerning its convergence. The estimates produced by SAMC can be improved continuously as the proceeds. SAMC also extends applications of the WL algorithm to continuum systems. The potential uses of SAMC in are discussed through two classes of applications, importance and . The results show that SAMC can work as a general importance sampling algorithm and a model selection sampler when the model space is complex. KEY WORDS: Importance sampling; Markov chain Monte Carlo; Model selection; Spatial autologistic model; Stochastic approximation; Wang–Landau algorithm.

1. INTRODUCTION the Wang–Landau (WL) algorithm (Wang and Landau 2001), and the generalized Wang–Landau (GWL) algorithm (Liang Suppose that we are interested in sampling from a distribu- 2004, 2005). These differ from the multicanonical algorithm tion that, for convenience, we write in the following form: only in the specification and/or the learning scheme for the trial p(x) = cp0(x), x ∈ X , (1) distribution. Other work included in this category is dynamic weighting (Wong and Liang 1997; Liu, Liang, and Wong 2001; X where c is a constant and is the sample space. As known Liang 2002), where the acceptance rate of the MH moves is ad- by many researchers, the Metropolis–Hastings (MH) sampler justed dynamically with an importance weight that carries the (Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller 1953; information of past samples. Hastings 1970) is prone to becoming trapped in local energy Among the algorithms described here, the WL algorithm minima when the energy landscape of the distribution is rugged. has received much recent attention in physics. It can be de- [In terms of physics, − log{p (x)} is called the energy func- 0 scribed as follows. Suppose that the sample space X is fi- tion of the distribution.] Over the last two decades, a various nite. Let U(x) =−log{p (x)} denote the energy function, let advanced Monte Carlo algorithms have been proposed to over- 0 {u ,...,u } be a set of real numbers containing all possible come this problem, based mainly on the following two ideas. 1 m values of U(x), and let g(u) = #{x : U(x) = u} be the number The first idea is the use of auxiliary variables. Included in this of states with energy equal to u.Inphysics,g(u) is called the category are the Swendsen–Wang algorithm (Swendsen and spectral density or the density of states of the distribution. For Wang 1987), simulated tempering (Marinari and Parisi 1992; simplicity, we also denote g(u ) by g in what follows. The WL Geyer and Thompson 1995), parallel tempering (Geyer 1991; i i algorithm is an adaptive Markov chain Monte Carlo (MCMC) Hukushima and Nemoto 1996), evolutionary Monte Carlo algorithm designed to estimate g = (g ,...,g ).Letg be the (Liang and Wong 2001), and others. In these algorithms, the 1 m i working estimate of g . A run of the WL algorithm consists of temperature is typically treated as an auxiliary variable. Sim- i several stages. The first stage starts with the initial estimates ulations at high temperatures broaden sampling of the sample g =···=g = 1 and a sample drawn from X at random, and space and thus are able to help the system escape from local 1 m iterates between the following steps: energy minima. The second idea is the use of past samples. The multicanon- The WL algorithm. ical algorithm (Berg and Neuhaus 1991) is apparently the first 1. Simulate a sample x by a single Metropolis update with work in this direction. This algorithm is essentially a dynamic the invariant distribution p(x) ∝ 1/g(U(x)). I(U(x)=u ) importance sampling algorithm in which the trial distribution is 2. Set gi ←giδ i for i = 1,...,m, where δ is a gain learned dynamically from past samples. Related works include factor >1 and I(·) is an indicator function. the 1/k-ensemble algorithm (Hesselbo and Stinchcombe 1995), The algorithm iterates until a flat has been pro- duced in the space of energy. Once the histogram is flat, the al- Faming Liang is Associate Professor, Department of Statistics, Texas gorithm will restart by passing ong(u) as the initial value of the A&M University, College Station, TX 77843 (E-mail: fl[email protected]). new stage and reducing δ to√ a smaller value according to a pre- Chuanhai Liu is Professor, Department of Statistics, Purdue University, West ← [email protected] specified scheme, say, δ δ. The process is repeated until δ Lafayette, IN 47907 (E-mail: ). Raymond J. Carroll − is Distinguished Professor, Department of Statistics, Texas A&M University, is very close to 1, say, log(δ) ≤ 10 8. Wang and Landau (2001) College Station, TX 77843 (E-mail: [email protected]). Liang’s research considered a histogram flat if the sampling for each was supported by grants from the National Science Foundation (DMS-04- 05748) and the National Cancer Institute (CA104620). Carroll’s research was energy value is not <80% of the average sampling frequency. supported by a grant from the National Cancer Institute (CA57030) and by the Texas A&M Center for Environmental and Rural Health through a grant from the National Institute of Environmental Health Sciences (P30ES09106). The © 2007 American Statistical Association authors thank Walter W. Piegorsch, the associate editor, and three referees for Journal of the American Statistical Association their suggestions and comments, which led to significant improvement of this March 2007, Vol. 102, No. 477, Theory and Methods article. DOI 10.1198/016214506000001202

305 306 Journal of the American Statistical Association, March 2007

Liang (2005) generalized the WL algorithm to continuum Moulines, and Priouret 2005), which ensures that the estimates systems. This generalization is mainly in terms of three re- of g can be improved continuously as the simulation proceeds. spects: the sample space, the working function, and the estimate It is shown that under mild conditions, SAMC will converge. updating scheme. Suppose that the sample space X is continu- In addition, SAMC can bias sampling to some subregions of ous and has been partitioned according to a chosen parameteri- interest, say the low-energy region, according to a distribution zation, say, the energy function U(x),intom disjoint subregions defined on the subspace of the partition. This is different from denoted by E1 ={x : U(x) ≤ u1}, E2 ={x : u1 < U(x) ≤ u2}, WL, where each energy must be sampled equally. It is also dif- ..., Em−1 ={x : um−2 < U(x) ≤ um−1}, and Em ={x : U(x)> ferent from GWL, where the sampling frequencies of the sub- um−1}, where u1,...,um−1 are m − 1 specified real numbers. regions follow a certain pattern determined by the parameter . Let ψ(x) be a nonnegative function defined on the sample space Hesselbo and Stinchcombe (1995) and Liang (2005) showed with 0 < X ψ(x) dx < ∞. In practice, ψ(x) is often set to numerically that biasing sampling to low-energy regions often p (x) defined in (1). Let g = ψ(x) dx. One iteration of GWL results in a simulation with improved ergodicity. This makes 0 i Ei consists of the following steps: SAMC attractive for difficult optimization problems. In addi- The GWL algorithm. tion, SAMC is user-friendly; it avoids the requirement of his- 1. Simulate a sample x by a number, κ, of MH steps of which togram checking during . We discuss the potential the invariant distribution is defined as use of SAMC in statistics through two classes of examples: im- portance sampling and model selection. It turns out that SAMC m ψ(x) can work as a general importance sampling method and a model p(x) ∝ I(x ∈ Ei). (2) gi selection sampler when the model space is complex. i=1 The article is organized as follows. In Section 2 we describe k 2. Set gJ(x)+k ← gJ(x)+k + δ gJ(x)+k for k = 0,...,m − the SAMC algorithm and study its convergence theory. In Sec- J(x), where J(x) is the index of the subregion to which tion 3 we compare WL and SAMC through a numerical exam- x belongs and >0 is a parameter that controls the sam- ple. In Section 4 we explore the use of SAMC in importance pling frequency for each of the subregions. sampling, and in Section 5 we discuss the use of SAMC in model selection. In Section 6 we conclude the article with a The extension of gi from the density of states to the ψ(x) dx is of great interest to , because this leads brief discussion. Ei to direct applications of the algorithm to model selection high- 2. STOCHASTIC APPROXIMATION MONTE CARLO est posterior density (HPD) region construction, and many other Bayesian computational problems. Liang (2005) also studied Consider the distribution defined in (1). For mathematical X the convergence of the GWL algorithm; as κ becomes large, gi convenience, we assume that is either finite (for a discrete is consistent for gi. However, when κ is small, say κ = 1—the system) or compact (for a continuum system). For a contin- choice adopted by the WL algorithm—there is no rigorous the- uum system, X can be restricted to the region {x : p0(x) ≥ pmin}, { ory to support the convergence of gi. In fact, some deficiencies where pmin is sufficiently small such that the region x : p0(x)< of the WL algorithm have been observed in simulations. Yan pmin} is not of interest. As in GWL, we let E1,...,Em de- X and de Pablo (2003) noted that estimates of gi can only reach a note m disjoint regions that form a partition of . In practice, limited statistical accuracy that will not be improved with fur- supx∈X p0(x) is often unknown. An inappropriate specification ther iterations, and the large number of configurations generated of ui’s may result in that some subregions are empty. A sub- region E is empty if g = ψ(x) dx = 0. SAMC allows the toward the end of the simulation make only a small contribution i i Ei (t) to the estimates. existence of empty subregions in simulations. Let gi denote We find that this deficiency of the WL algorithm is caused by the estimate of gi obtained at iteration t. For convenience, we the choice of the gain factor δ. This can be explained as follows. = (t) = let θti log(gi ) and θt (θt1,...,θtm). The distribution (2) Let ns be the number of iterations performed in stage s and let δs can then be rewritten as be the gain factor used in stage s.Letn =···=n =···=n, 1 s m ψ(x) where n is large enough such that a flat histogram can be = 1 ∈ = pθt (x) θ I(x Ei), i 1,...,m. (3) = 1 Zt e ti reached in each stage. Let log δs 2 log δs−1 decrease geomet- i=1 rically as suggested by Wang and Landau (2001). Then the tail ∞ For theoretical simplicity, we assume that θ ∈  for all t, where sum n log δ < ∞ for any value of S. Note that the tail t s=S+1 s  is a compact set. In this article we set  =[−10100, 10100]m sum represents the total correction to the current estimate in the for all examples, although as a practical matter this is essentially iterations that follow. Hence the numerous configurations gen- equivalent to setting  = Rm. Because p (x) is invariant with erated toward the end of the simulation make only a small con- θt respect to a location transformation of θ —that is, adding to or tribution to the estimates. To overcome this deficiency, Liang t subtracting a constant vector from θ will not change p (x)—θ (2005) suggested that n should increase geometrically with the t θt t s can be kept in the compact set in simulations by adjusting with a rate log δ + / log δ . However, this leads to an explosion in the s 1 s constant vector. Because X and  are both assumed to be com- total number of iterations required by the simulation. pact, an additional assumption is that p (x) is bounded away In this article we propose a stochastic approximation Monte θt from 0 and ∞ on X . Let the proposal distribution q(x, y) sat- Carlo (SAMC) algorithm, which can be considered a stochas- isfy the following condition: For every x ∈ X , there exist  > 0 tic approximation correction of the WL and GWL algorithms. 1 and  > 0 such that In SAMC, the choice of the gain factor is guided by a condi- 2 tion given in the stochastic approximation algorithm (Andrieu, |x − y|≤1 ⇒ q(x, y) ≥ 2. (4) Liang, Liu, and Carroll: Stochastic Approximation in Monte Carlo Computation 307

This is a natural condition in a study of MCMC theory (Roberts determine the value of C, extra information is needed; for exam- m θti and Tweedie 1996). In practice, this kind of proposal can be ple, i=1 e is equal to a known number. Let πti denote the re- easily designed for both continuum and discrete systems. For alized sampling frequency of the subregion Ei at iteration t.As a continuum system, q(x, y) can be set to the random-walk t →∞, πti converges to πi + d if Ei =∅and 0 otherwise. Note 2 2 Gaussian proposal y ∼ N(x,σ I), with σ calibrated to have that for a nonempty subregion, the sampling frequency is inde- a desired acceptance rate. For a discrete system, q(x, y) can be pendent of its probability, p(x) dx. This implies that SAMC Ei set to a discrete distribution defined on a neighborhood of x by is capable of exploring the whole sample space, even for the assuming that the states have been ordered in a certain way. regions with tiny probabilities. Potentially, SAMC can be used = Let π (π1,...,πm) be an m-vector with 0 <πi < 1 and to sample rare events from a large sample space. In practice, m = i=1 πi 1, which defines the desired sampling frequency SAMC tends to lead to a “random walk” in the space of non- for each of the subregions. Henceforth, π is called the desired empty subregions (if each subregion is considered a “point”) . Let {γt} be a positive, nondecreasing se- with the sampling frequency of each nonempty subregion be- quence satisfying ing proportional to πi + d. ∞ ∞ The subject area of stochastic approximation was founded ζ (a) γt =∞ and (b) γt < ∞ (5) by Robbins and Monro (1951). After five decades of continual t=1 t=1 development, it has developed into an important area in sys- for some ζ ∈ (1, 2). For example, in this article we set tems control and optimization and has also served as a proto- type for the development of recursive algorithms for on-line t0 γt = , t = 1, 2,..., (6) estimation and control of stochastic systems. (See Lai 2003 max(t0, t) for an overview.) Recently, it has been used with MCMC to for some specified value of t0 > 1. With the foregoing notation, solve maximum likelihood estimation problems (Younes 1988, one iteration of SAMC can be described as follows: 1999; Moyeed and Baddeley 1991; Gu and Kong 1998; Gelfand and Banerjee 1998; Delyon, Lavielle, and Moulines 1999; Gu The SAMC algorithm. + and Zhu 2001). The critical difference between SAMC and 1. Simulate a sample x(t 1) by a single MH update, of which other stochastic approximation MCMC algorithms lies in sam- the proposal distribution is q(x(t), ·) and the invariant dis- ple space partitioning. With our use of partitioning, many new tribution is pθ (x). ∗ t applications can be established in Monte Carlo computation, 2. Set θ = θt + γt+1(et+1 − π), where et+1 = (et+1,1,..., (t) for example, importance sampling and model selection, as de- et+1,m) and et+1,i = 1ifx ∈ Ei and 0 otherwise. If ∗ ∗ ∗ ∗ scribed in Sections 4 and 5. In the same spirit, SAMC can also θ ∈ , then set θt+1 = θ ; otherwise, set θt+1 = θ + c , where c∗ = (c∗,...,c∗) can be an arbitrary vector that sat- be applied to HPD interval construction, normalizing constant isfies the condition θ ∗ + c∗ ∈ . estimation, and other problems, as discussed by Liang (2005). In addition, sample space partitioning improves its performance Remark. The explanation for condition (5) can be found in in optimization. Control of the sampling frequency effectively advanced books on stochastic approximation (e.g., Nevel’son prevents the system from getting trapped into local energy min- and Has’minski˘i 1973). The first condition is necessary for ima in simulations. We will explore this issue further elsewhere. ∞ ∞ the convergence of θt.If t=1 γt < , then, as follows from It is noteworthy that Geyer and Thompson (1995) and Geyer step (b) (assuming that adjustment of θt does not occur), (1996) mentioned that stochastic approximation can be used to ∞ | − |≤ ∞ | − |≤ ∞ ∞ t=1 θt+1,i θti t=1 γt eti πi t=1 γt < , where determine the “pseudopriors” for simulated tempering (i.e., de- the second inequality follows from the fact 0 ≤ eti,πi ≤ 1. Thus termining the normalizing constants of a sequence of distribu- the value of θti does not reach log(gi) if, for example, the ini- tions scaled by temperature), although they provided no details. tial point θ0i is sufficiently far away from log(gi). On the other In Geyer’s applications, the sample space is partitioned auto- hand, γt cannot be too large; an overly large γt will prevent matically according to the temperature variable. convergence. It turns out that the second condition in (5) as- For effective implementation of SAMC, several issues must ymptotically damps the effect of the random errors introduced be considered: by et. When it holds, we have γt|eti − πi|≤γt → 0ast → 0. • Sample space partition. This can be done according to our SAMC falls into the category of stochastic approximation goal and the complexity of the given problem. For exam- algorithms (Benveniste, Métivier, and Priouret 1990; Andrieu ple, if we aim to construct a trial density function for im- et al. 2005). Theoretical results on the convergence of SAMC portance sampling (as illustrated in Sec. 4) or to minimize are given in the Appendix. The theory states that under mild the energy function, then the sample space can be parti- conditions, we have  tioned according to the energy function. The maximum  energy difference in each subregion should be bounded by C + log ψ(x) dx − log(πi + d) if Ei =∅ θti → a reasonable number, say 2, to ensure that the local MH  Ei −∞ if Ei =∅, moves within the same subregion have a reasonable ac- (7) ceptance rate. Note that within the same subregion, sam- as t →∞, where d = π /(m − m ), m is the num- pling from the working density (3) is reduced to sampling j∈{i:Ei=∅} j 0 0 ber of empty subregions, and C is an arbitrary constant. Because from ψ(x). If our goal is model selection, then the sample pθt (x) is invariant with respect to a location transformation of θt, space can be partitioned according to the index of models, C cannot be determined by the samples drawn from pθt (x).To as illustrated in Section 5. 308 Journal of the American Statistical Association, March 2007

• The desired sampling distribution. If we aim to estimate g, multiple-run convergence is detected, SAMC should be rerun then we may set the desired distribution to be uniform, as with more iterations or a larger value of t0. Determining t0 and is done in all examples in this article. However, if our goal the number of iterations is a trial-and-error process. is optimization, then we may set the desired distribution biased to low-energy regions. As shown by Hesselbo and 3. TWO DEMONSTRATION EXAMPLES Stinchcombe (1995) and Liang (2005), biasing sampling Example 1. In this example we compare the convergence and to low-energy regions often improves the ergodicity of the efficiency of the WL and SAMC. The distribution of the exam- simulation. Our numerical results on BLN protein mod- ple consists of 10 states with the unnormalized mass function els (Honeycutt and Thirumalai 1990) also strongly support P(x) as given in Table 1. It has two modes that are well sepa- this point. Due to space limitations, we will report these rated by low-mass states. results elsewhere. The sample space was partitioned according to the mass • Choice of t and the number of iterations. g 0 To estimate , function into the following five subregions: E1 ={8}, E2 = γ should be very close to 0 at the end of simulations. Oth- t {2}, E3 ={5, 6}, E4 ={3, 9}, and E5 ={1, 4, 7, 10}.Insim- erwise, the resulting estimates will have a large variation. ulations, we set ψ(x) = 1. The true value of g is then g = The speed of γt going to 0 can be controlled by t0. In prac- (1, 1, 2, 2, 4), which is the number of states in the respective tice, t0 can be chosen according to the complexity of the subregions. The proposal used in the MH step is a stochas- problem. The more complex the problem, the larger the tic matrix of which each row is generated independently from value of t0 that should be chosen. A large t0 will force the the Dirichlet distribution Dir(1,...,1). The desired sampling sampler to reach all subregions quickly, even in the pres- distribution is uniform, that is, π1 =···=π5 = 1/5. The se- ence of multiple local energy minima. In our experience, quence {γt} is as given in (6) with t0 = 10. SAMC was run t0 is often set to between 2m and 100m, with m being the for 100 times independently. Each run consists of 5 × 105 it- number of subregions. erations. The estimation error of g was measured by the func- The appropriateness of the choice of t and the number of = (t) − 2 0 tion e(t) E =∅(gi gi) /gi at 10 equally spaced time iterations can be determined by checking the convergence of i points, t = 5 × 104,...,5 × 105. Figure 1(a) shows the curve multiple runs (starting with different points) through examinat- of  (t) obtained by averaging over the 100 runs. The ing for the variation of g or π, where g and π denote the esti- e  (E ) was calculated at time t = 105 for each run. The results mates of g and π obtained at the end of a run. A rough exam- f i show that they match well. Figure 1(b) shows boxplots of the ination for g is to check visually whether or not the g vectors  (E )’s of the 100 runs. The deviations are <3%. This indicates produced in the multiple runs follow the same pattern. Exis- f i that SAMC has achieved the desired sampling distribution and tence of different patterns implies that the gain factor is still the choice of t and the number of iterations are appropriate. large at the end of the runs, or that some parts of the sample 0 Other choices of t , including t = 20 and 30, were also tried, space are not visited in all runs. The examination forg can also 0 0 and the results were similar. be done by a statistical test under the assumption of multivari- We applied the WL algorithm to this example with the same ate normality. (See Jobson 1992, pp. 150–153, for the testing proposal distribution as that used in SAMC. In the runs, the gain methods for multivariate outliers.) factor was set as done by Wang and Landau (2001); it started√ To examine the variation of π, we define the statistic f (Ei), with δ = 2.718 and then decreased in the scheme δ + → δ . which measures the deviation of π , the realized sampling fre- 0 s 1 s i Let n denote the number of iterations performed in stage s.For quency of subregion E in a run, from its theoretical value. The s i simplicity, we set n to a constant that was large enough such statistic is defined as s  that a flat histogram can be formed in each stage. The choices of  − + πi (πi d) ns that we tried included ns = 1,000, 2,500, 5,000, and 10,000. × 100% if Ei is visited f (Ei) = (8)  πi + d The estimation error was also measured by e(t) evaluated at 0 otherwise, t = 5 × 104,...,5 × 105, where t is the total number of itera- = = −  tions made so far in the run. Figure 1(a) shows the curves of for i 1,...,m, where d j∈{i:E is not visited} πj/(m m0)  i  (t) for each choice of n , where each curve was obtained by and m is the number of subregions that are not visited. Note e s 0 averaging over 100 independent runs. that d can be considered an estimate of d in (7). It is said that The comparison shows that for this example, SAMC pro- { (E )}, output from all runs and for all subregions, matches f i duces more accurate estimates for g and converges much faster well if the following two conditions are satisfied: (a) There does than WL. More importantly, in SAMC the estimates can be im- not exist such a subregion that is visited in some runs but not proved continuously as the simulation proceeds, whereas in WL in others, and (b) maxm | (E )| is less than a threshold value, i=1 f i the estimates can reach only a certain accuracy depending on say 10%, for all runs. A group of { (E )} that does not match f i the value of n . well implies that some parts of the sample space are not vis- s ited in all runs, t0 is too small (the self-adjusting ability is thus Example 2. As pointed out by Liu (2001), umbrella sam- weak), or the number of iterations is too small. We note that the pling (Torrie and Valleau 1977) can be seen as a precursor of idea of monitoring convergence of MCMC simulations using multiple runs was discussed by Gelman and Rubin (1992) and Table 1. The Unnormalized Mass Function of the 10-State Distribution Geyer (1992). x 1 234567 8910 In practice, to have a reliable diagnostic for the convergence, P(x) 1 100 2 1 3 3 1 200 2 1 we may check both g and π. In the case where a failure of Liang, Liu, and Carroll: Stochastic Approximation in Monte Carlo Computation 309

(a) (b)

Figure 1. Comparison of the WL and SAMC Algorithms. (a) Average e(t) curves obtained by SAMC and WL. The vertical bars show the ±1 of the average of the estimates ( SAMC; WL, n = 1,000; WL, n = 2,500; WL, n = 5,000; WL, n = 10,000). (b) Boxplots of {f (Ei )} obtained in 100 runs of SAMC.

2 many advanced Monte Carlo algorithms, including simulated U(x) =−{x1 sin(20x2) + x2 sin(20x1)} cosh{sin(10x1)x1}− 2 tempering, multicanonical, and thus WL, GWL, and SAMC. {x1 cos(10x2) − x2 sin(10x1)} cosh{cos(20x2)x2}. This exam- Although umbrella sampling was proposed originally for esti- ple is modified from example 5.3 of Robert and Casella mating the ratio of two normalizing constants, it also can be (2004). Figure 2(a) shows that U(x) has a multitude of lo- used as a general importance sampling method. Recall that the cal energy minima separated by high-energy barriers. When basic idea of umbrella sampling is to sample from an “umbrella applying SAMC to this example, we partitioned the sample distribution” (i.e., a trial distribution in terms of importance space into 41 subregions with an equal energy bandwidth, sampling), which covers the important regions of both target E1 ={x : U(x) ≤−8.0}, E2 ={x : −8.0 < U(x) ≤−7.8},..., distributions. Torrie and Valleau (1977) proposed two possible and E41 ={x : −.2 < U(x) ≤ 0}, and set other parameters as −U(x) schemes for construction of umbrella distributions. One is to follows: ψ(x) = e , t0 = 200, π1 =···=π41 = 1/41, and 2 sample intermediate systems of the temperature-scaling form a random-walk proposal, q(xt, ·) = N2(xt,.25 I2).SAMCwas (i) 1/Ti pst (x) ∝[p0(x)] for Tm > Tm−1 > ···> T1 = 1. This leads run for 20,000 iterations, and 2,000 samples were collected at directly to the simulated tempering algorithm. The other one is equally spaced time points. Figure 2(b) shows the evolving path to sample a weighted distribution pu(x) ∝ ω{U(x)}p0(x), where of the 2,000 samples. For comparison, MH was applied to sim- −U(x)/5 the weight function ω(·) is a function of the energy variable and ulate from the distribution pst(x) ∝ e . MH was run for 2 can be determined by a pilot study. Thus umbrella sampling 20,000 iterations with the same proposal N2(xt,.25 I2), and can be seen as a precursor of multicanonical, WL, GWL, and 2,000 samples were collected at equally spaced time points. SAMC. Sample space partitioning, motivated by discretization Figure 2(c) shows the evolving path of the 2,000 samples, of continuum systems, provides a new methodology for apply- which characterizes the performance of simulated tempering ing umbrella sampling to continuum systems. at high temperatures. Although SAMC and simulated tempering both fall in the The result is clear. In the foregoing setting, SAMC samples class of umbrella sampling algorithms, they have quite different almost uniformly in the space of energy [i.e., the energy band- dynamics. This can be illustrated by an example. The distribu- width of each subregion is small, and the sample distribution tion is defined as p(x) ∝ e−U(x), where x ∈[−1.1, 1.1]2 and closely matches the contour plot of U(x)], whereas simulated 310 Journal of the American Statistical Association, March 2007

(a) (b) (c)

Figure 2. (a) Contour of U(x), (b) Sample Path of SAMC, and (c) Sample Path of MH. tempering tends to sample uniformly in the sample space X ergy landscape, the target distribution p(x) is very difficult to when the temperature is high. Because we usually do not know simulate from with conventional Monte Carlo algorithms. Let ∗ where the high-energy and low-energy regions are and how x1,...,xn denote the samples drawn from a trial density p (x), much the ratio of their “volumes” is a priori, we cannot control and let w1,...,wn denote the associated importance weights, ∗ the simulation time spent on low-energy and high-energy re- where wi = p(xi)/p (xi) for i = 1,...,n. The quantity Eph(x) gions in simulated tempering. However, we can control almost then can be estimated by exactly, up to the constant d in (7), the simulation time spent on n h x w = i=1 ( i) i low-energy and high-energy regions in SAMC by choosing the Eph(x) n . (9) desired sampling distribution π. SAMC can go to high-energy i=1 wi regions, but it spends only limited time there to help the sys- Although this estimate converges almost surely to Eph(x), its tem to escape from local energy minima, and also spends time is finite only if exploring low-energy regions. This smart simulation time dis- p(x) 2 p2(x) tribution scheme makes SAMC potentially more efficient than ∗ 2 = 2 ∞ Ep h (x) ∗ dx h (x) ∗ dx < . simulated tempering in optimization. Due to the space limita- p (x) X p (x) tions, we do not explore this point in this article. But we note ∗ If the ratio p(x)/p (x) is unbounded, then the weight p(xi)/ that Liang (2005) reported a neural network training example ∗ p (x ) will vary widely, and thus the resulting estimate will be in which it was shown GWL is more efficient than simulated i unreliable. A good trial density should necessarily satisfy the tempering in locating global energy minima. following two conditions: 4. USE OF STOCHASTIC APPROXIMATION MONTE a. The importance weight is bounded; that is, there exists a CARLO IN IMPORTANCE SAMPLING number M such that p(x)/p∗(x)

In addition, the trial density should be chosen to have a similar fromp∞(x) at iteration t. The procedure consists of shape to the true density. This will minimize the variance of the following three steps: the resulting importance weights. Of course, how to specify an The SAMC importance-resampling algorithm. appropriate trial density for a general distribution has been a 1. Draw a sample xt ∼ p∞(x) using a conventional Monte long-standing and difficult problem in statistics. Carlo algorithm, say, the MH algorithm. The defensive mixture method (Hesterberg 1995) suggests 2. Draw a random number U ∼ uniform(0, 1).IfU <ωk, the following trial density: then save xt as a sample of p(x), where k is the index of ∗ p (x) = λp(x) + (1 − λ)p(x), (10) the subregion xt belongs to. 3. Set t ← t + 1 and go to step 1, until sufficient samples where 0 <λ<1 and p(x) is another density. However, in prac- have been collected. tice, p∗(x) is rather poor. Although the resulting importance Consider the following distribution: weights are bounded above by 1/λ, it cannot be easily sam- pled from using conventional Monte Carlo algorithms. Because 1 −8 1 .9 ∗ p(x) = N , p (x) contains p(x) as a component, if we can sample from −8 .91 ∗ 3 p (x), then we can also sample from p(x). In this case we do not need to use importance sampling! Stavropoulos and Titter- 1 6 1 −.9 + N , ington (2001), Warnes (2001), and Cappé, Giullin, Marin, and 3 6 −0.91 Robert (2004) suggested constructing the trial density based on previous Monte Carlo samples, but the trial densities result- 1 0 10 + N , , ing from their methods cannot guarantee that the importance 3 0 01 weights are bounded. We note that these methods are similar to SAMC in the spirit of learning from historical samples. Other which is identical to that given by Gilks, Roberts, and Sahu trial densities based on simple mixtures of normals or t distri- (1998), except that the vectors are separated by a larger butions also may result in unbounded importance weights, al- distance in each dimension. Figure 3(a) shows its contour though they can be sampled from easily. plot, which contains three well-separated components. The MH Suppose that the sample space has been partitioned accord- algorithm has been used to simulate from p(x) with a random- ing to the energy function, and that the maximum energy dif- walk proposal N(x, I2), but it failed to mix the three compo- ference in each subregion has been bounded by a reasonable nents. However, an advanced MCMC sampler, such as sim- number such that the local MH move within the same subre- ulated tempering, parallel tempering, or evolutionary Monte gion has a reasonable acceptance rate. It is then easy to see that Carlo, should work well for this example. Our purpose in study- ing this example was just to illustrate how SAMC can be used in the distribution defined in (2) or (3) satisfies the foregoing two importance sampling as a universal trial distribution constructor conditions and can work as a universal trial density even in the and how SAMC can be used as an advanced sampler to sample presence of multiple local minima on the energy landscape of from a multimodal distribution. the true density. Let We applied SAMC to this example with the same proposal as 100 100 2 m ψ(x) used in the MH algorithm. Let X =[−10 , 10 ] be com- p∞(x) ∝ I(x ∈ E ) = i (11) pact. It was partitioned with an equal energy bandwidth u 2 = gi i 1 into the following subregions: E1 ={x : − log p(x)<0}, E2 = { ≤− } ={ − } denote the trial density constructed by SAMC with ψ(x) = x :0 log p(x)<2 , ..., and E12 x : log p(x)>20 . = θti Set ψ(x) = p(x), t = 50 and the desired sampling distribution p0(x), where gi limt→∞ e . Assuming that gi has been nor- 0 m to be uniform. In a run of 500,000 iterations, SAMC produced malized by an additional constraint (e.g., = gi is a known i 1 a trial density with the contour plot as shown in Figure 3(b). On constant), the importance weights are then bounded above by maxm g < ψ(x) dx < ∞. As shown in Section 2, sampling the plot there are many contour circles formed due to the den- i=1 i X sity adjustment byg ’s. The adjustment that many points from p∞(x) will lead to a “random walk” in the space of non- i of the sample space have the same density value. The SAMC empty subregions. Hence the whole sample space can be well importance-sampling algorithm was then applied to simulate explored. samples from p(x). Figure 3(c) shows the sample path of the Besides satisfying the conditions (a) and (b), p∞(x) has two first 500 samples generated by the algorithm. All three compo- additional advantages over other trial densities. First, the simi- nents had been well mixed. Later, the run was lengthened, and larity of the trial density to the target density can be controlled the mean and variance of the distribution were estimated ac- to some extent by the user. For example, instead of (11), we can curately using the simulated samples. The results indicate that sample from the following density: SAMC can indeed work as a general trial distribution construc- m ψ(x) tor for importance sampling and an advanced sampler for sim- p∞(x) ∝ I(x ∈ Ei), (12) ulation from a multimodal distribution. λigi i=1 5. USE OF STOCHASTIC APPROXIMATION MONTE = where the parameters λi, i 1,...,m, control the sampling fre- CARLO IN MODEL SELECTION PROBLEMS quency of the subregions. Second, resampling can be made on- line if we are interested in generating equally weighted samples 5.1 Algorithms = m from p(x).Letωi gi/ maxj=1 gj denote the resampling prob- Suppose that we have a posterior distribution denoted by ability from the subregion Ei, and xt denote the sample drawn f (M,ϑM|D), where D denotes the , M is the index of 312 Journal of the American Statistical Association, March 2007

(a) (b) (c)

Figure 3. Computational Results for the Mixture Gaussian Example. (a) and (b) Contour plots of the true and trial densities. The contour lines correspond to 99%, 95%, 50%, 5%, and 1% of the total mass. (c) The path of the first 500 samples simulated from p(x) by the SAMC importance-sampling algorithm.

models, and ϑM is the vector of parameters associated with iteration t. One iteration of SAMC comprises of the following model M. Without loss of generality, we assume that only a steps: finite number, m, of models are under consideration and that The SAMC model-selection algorithm. the models are subject to a uniform prior. The sample space 1. Generate model M∗ according to the proposal matrix Q. m of f (M,ϑM|D) can be written as = XM , where XM de- ∗ = (t) ∗ | i 1 i i 2. If M M , then simulate a sample ϑ from f (ϑM(t) | = X (t) (t+1) notes the sample space of f (ϑMi Mi, D).IfweletEi Mi for M , D) by a single MCMC iteration and set (M , + ∗ ∗ i = 1,...,m, and ψ(·) ∝ f (M,ϑM|D), it follows from (7) that ϑ(t 1)) = (M ,ϑ ). − (t) (t) = θti θtj ∗ (t) ∗ gi /gj e forms a consistent estimator for the Bayes 3. If M = M , then generate ϑ according to the proposal ∗ ∗ factor of the models Mi and Mj,1≤ i, j ≤ m. We note that distribution T and accept the sample (M ,ϑ ) with prob- reversible-jump MCMC (RJMCMC) (Green 1995) can also es- ability timate the Bayes factors of m models simultaneously. For com- θ ∗ ∗ e t,M(t) f (M ,ϑ |D) parison, in what follows we give explicitly the iterative proce- min 1, dures of the two methods for Bayesian model selection. eθt,M∗ f (M(t),ϑ(t)|D) Let Q(Mi → Mj) denote the proposal probability for a tran- ∗ → (t) ∗ → (t) × Q(M M ) T(ϑ ϑ ) sition from model M to model M , and T(ϑ → ϑ ) denote ∗ ∗ . (13) i j Mi Mj Q(M(t) → M ) T(ϑ(t) → ϑ ) the proposal distribution of generating ϑMj conditional on ϑMi . Assume that both Q and T satisfy the condition (4). Let M(t) If this is accepted, then set (M(t+1),ϑ(t+1)) = (M∗,ϑ∗); and ϑ(t) denote the model and the model parameters sampled at otherwise, set (M(t+1),ϑ(t+1)) = (M(t),ϑ(t)). Liang, Liu, and Carroll: Stochastic Approximation in Monte Carlo Computation 313

∗ 4. Set θ = θt + γt+1(et+1 − π), where et+1 = (et+1,1,..., model has a significantly higher probability than its neighbor- (t+1) et+1,m) and et+1,i = 1ifM = Mi and 0 otherwise. If ing models. In SAMC, the estimates of the model probabilities ∗ ∗ ∗ ∗ θ ∈ , then set θt+1 = θ ; otherwise, set θt+1 = θ + c , are updated in the logarithmic scale; this makes it possible for where c∗ is chosen such that θ ∗ + c∗ ∈ . SAMC to work for a group of models with huge differences in probability. This is beyond the ability of RJMCMC, which can Let (t) = #{M(k) = M : k = 1, 2,...,t} be the sampling fre- i i work only for a group of models with comparable probabilities. quency of model M during the first t iterations in a run of i Finally, we point out that for a problem that contains only RJMCMC. With the same proposal matrix Q, the same proposal several models with comparable probabilities, SAMC may not distribution T and the same MH step (or Gibbs cycle) as those be better than RJMCMC, because in this case its self-adjusting used by SAMC, one iteration of RJMCMC can be described as ability is no longer crucial for mixing of the models. SAMC follows: is essentially an importance sampling method (i.e., the samples The RJMCMC algorithm. are not equally weighted); hence its efficiency should be lower 1. Generate model M∗ according to the proposal matrix Q. than that of RJMCMC for a problem in which RJMCMC suc- ∗ = (t) ∗ | 2. If M M , then simulate a sample ϑ from f (ϑM(t) ceeds. In summary, we suggest using SAMC when the model M(t), D) by a single MCMC iteration and set (M(t+1), space is complex, for example, when the distribution f (M|D) ϑ(t+1)) = (M∗,ϑ∗). has well-separated multiple modes or when there are probabil- 3. If M∗ = M(t), then generate ϑ∗ according to the proposal ity models that are tiny but of interest to us. density T and accept the sample (M∗,ϑ∗) with probabil- ity 5.2 Numerical Results f (M∗,ϑ∗|D) The autologistic model (Besag 1974) has been widely used min 1, f (M(t),ϑ(t)|D) for spatial data analysis (see, e.g., Preisler 1993; Augustin, Mugglestone, and Buckland 1996). Let s ={s : i ∈ D} denote ∗ ∗ i Q(M → M(t)) T(ϑ → ϑ(t)) a configuration of the model in which the binary response × . (t) ∗ (t) ∗ (14) Q(M → M ) T(ϑ → ϑ ) si ∈{−1, +1} is called a spin and D is the set of indices of the spins. Let |D| denote the total number of spins in D, and (t+1) (t+1) = ∗ ∗ If it is accepted, then set (M ,ϑ ) (M ,ϑ ); let N(i) denote a set of the neighbors of spin i. The probability otherwise, set (M(t+1),ϑ(t+1)) = (M(t),ϑ(t)). (t+1) (t) (t+1) mass function of the model is 4. Set  =  + I(M = Mi) for i = 1,...,m. i i 1 β p(s|α, β) = exp α si + si sj , The standard MCMC theory (Tierney 1994) implies that as ϕ(α,β) 2 →∞ (t) (t) i∈D i∈D j∈N(i) t , i /j forms a consistent estimator for the of model Mi and model Mj. (α, β) ∈ , (15) We note that the form of the RJMCMC algorithm just de- scribed is not the most general one, where the proposal dis- where  is the parameter space and ϕ(α,β) is the normalizing ·→· constant defined by tribution T( ) is assumed such that the Jacobian term in (14) is reduced to 1. This observation is also applicable to the β ϕ(α,β) = exp α si + si sj . SAMC model-selection algorithm. The MCMC algorithm used 2 in step 2 of the foregoing two algorithms can be the MH al- for all possible s i∈D i∈D j∈N(i) gorithm, the Gibbs sampler (Geman and Geman 1984), or any The parameter α determines the overall proportion of si with a other advanced MCMC algorithms, such as simulated temper- value of +1, and the parameter β determines the intensity of ing, parallel tempering, evolutionary Monte Carlo, and SAMC the between si and its neighbors. importance-resampling (discussed in Sec. 3). When the distri- A major difficulty with this model is that the function ϕ(α,β) | bution f (M,ϑM D) is complex, an advanced MCMC algorithm is generally unknown analytically. Evaluating ϕ(α,β) exactly may be chosen and multiple iterations may be used in this step. is prohibitive even for a moderate system, because it requires SAMC and RJMCMC are different only at steps 3 and 4, summary over all 2|D| possible realizations of s. Because that is, in the manner of acceptance for a new sample and esti- ϕ(α,β) is unknown, importance sampling is perhaps the most mation for the model probabilities. In SAMC, a new sample convenient technique if we aim at calculating the expectation is accepted with an adjusted probability. The adjustment al- Eα,β h(s) over the parameter space. This problem is a little dif- ways works in the reverse direction of the estimation error of ferent than conventional importance sampling problems dis- the model probability or, equivalently, the frequency discrep- cussed in Section 4, where we have only one target distribution, ancy between the realized sampling frequency and the desired whereas here we have multiple target distributions indexed by one. Thus it guarantees convergence of the algorithm. In sim- their parameter values. A natural choice for the trial distribution ulations, we can see that SAMC can overcome any difficulties is a mixture distribution of the form in dimension-jumping moves and provide a full exploration for m∗ all models. Recall that the proposal distributions have been as- ∗ 1 p (s) = p(s|α ,β), (16) sumed to satisfy the condition (4). Because RJMCMC does not mix m∗ j j j= have the self-adjusting ability, it samples each model in a fre- 1 quency proportional to its probability. In simulations, we can where the values of the parameters (α1,β1),...,(αm∗ ,βm∗ ) see that RJMCMC often stays on a model for a long time if that are prespecified. We note that this idea has been suggested 314 Journal of the American Statistical Association, March 2007

(a) (b)

Figure 4. U.S. Cancer Mortality Rate Data. (a) The mortality map of liver and gallbladder cancer (including bile ducts) for white males during the decade 1950Ð1959. The black squares denote the counties of high cancer mortality rate, and the white squares denote the counties of low cancer mortality rate. (b) Fitted cancer mortality rates. The cancer mortality rate of each county is represented by the gray level of the corresponding square. by Geyer (1996). To complete this idea, the key is to esti- (2006), we modeled the data by a spatial autologistic model. mate ϕ(αj,βj),..., ϕ(αm∗ ,βm∗ ). The estimation can be up to The total number of spins is |D|=2,293. Suppose that the para- a common multiplicative constant that will be canceled out meter points used in (16) form a 21×11 lattice (m∗ = 231) with in calculation of Eα,β h(s) in (9). Geyer (1996) also suggested α equally spaced between −.5 and .5 and β between 0 and .5. stochastic approximation as a feasible method for simultane- Because ϕ(α,β) is a symmetric function about α, we only need ously estimating ϕ(α1,β1),...,ϕ(αm∗ ,βm∗ ), but gave no de- to estimate it on a sublattice with α between 0 and .5. The sub- tails. Several authors have based their inferences for a distri- lattice consists of m = 121 points. Estimating the quantities bution like (15) on the estimates of the normalizing constant ϕ(α1,β1), . . . , ϕ(αm,βm) can be treated as a Bayesian model function at a finite number of points. For example, Diggle and selection problem, although no observed data are involved. This Gratton (1984) proposed estimating the normalizing constant is because ϕ(α1,β1), . . . , ϕ(αm,βm) correspond to the normal- function on a grid, smoothing the estimates using a kernel izing constants of different distributions. In what follows, the method, and then substituting the smooth estimates into (15) SAMC model-selection algorithm and the RJMCMC algorithm as known for finding maximum likelihood estimators (MLEs) were applied to this problem by treating each grid point (αj,βj) of the parameters. A similar idea was also been proposed by as a different model and p(s|αj,βj)ϕ(αj,βj) as the posterior dis- Green and Richardson (2002) for analyzing a disease-mapping tribution used in (13) and (14). example. SAMC was first applied to this problem. The proposal ma- In this article we explore the idea of Geyer (1996) and give trix Q, the proposal distribution T, and the MCMC sampler details about how SAMC can be used to simultaneously es- used in step 2 are specified as follows. Let the m models be ∗ ∗ timate ϕ(α1,β1),...,ϕ(αm ,βm ) and how the estimates can coded as a matrix (Mij) with i = 0,...,10 and j = 0,...,10. be further used in estimation of the model parameters. The The proposal matrix Q is then defined as dataset considered is the U.S. cancer mortality rate, as shown (α) (β) →   = in Figure 4(a). Following Sherman, Apamasovich, and Carroll Q(Mij Mi j ) qii qjj , Liang, Liu, and Carroll: Stochastic Approximation in Monte Carlo Computation 315

(α) = (α) = (α) = = (α) = | | where qi,i−1 qi,i qi,i+1 1/3fori 1,...,9, q0,0 1)/(2 D ). By averaging over the five runs, we obtained one (α) (α) (α) (β) (β) estimate of the function, as shown in Figure 5(b). To assess the q = 2/3, and q = q = 1/3; and q − = q = 10,10 0,1 10,9 i,i 1 i,i variation of the estimate, we calculated the standard deviation q(β) = 1/3fori = 1,...,9, q(β) = q(β) = 2/3, and q(β) = i,i+1 0,0 10,10 0,1 of the estimate at each grid point of the lattice. The average of (β) = −4 q10,9 1/3. For this example, ϑ corresponds to the configu- the standard deviations is 3×10 . The estimate is fairly stable. ration s of the model. The proposal distribution T(ϑ(t) → ϑ∗) Using the importance samples collected earlier, we also esti- is an identical mapping, that is, keeping the current configu- mated the parameters (α, β) for the cancer data shown in Fig- ration unchanged when a model is proposed to be changed to ure 4(a). The estimation can be done using the Monte Carlo another one. Thus we have T(ϑ(t) → ϑ∗) = T(ϑ∗ → ϑ(t)) = 1. maximum likelihood method (Geyer and Thompson 1992; The MCMC sampler used in step 2 is the Gibbs sampler (Ge- Geyer 1994) as follows. Let p∗(s) = c∗p∗(s) denote an ar- 0∗ man and Geman 1984), sampling spin i from the conditional bitrary trial distribution for (15), where p0(s) is completely distribution specified and c∗ is an unknown constant. Let ψ(α,β,s) = 1 ϕ(α,β)p(s|α, β) and let L(α, β|s) denote the log-likelihood P(s =+1|N(i)) = , i −2(α+β s ) function of an observation s. Thus 1 + e j∈N(i) j (17) P(si =−1|N(i)) = 1 − P(si =+1|N(i)), Ln(α, β|s) for all i ∈ D in a prespecified order. β ∗ = α si + si sj + log c SAMC was run five times independently. Each run consisted 2 i∈D i∈D j∈N(i) of two stages. The first stage estimated the function ϕ(α,β) on 4 n (k) the sublattice. In this stage, SAMC was run with t0 = 10 and 1 ψ(α,β,s ) 8 − log (19) for 10 iterations. The second stage drew importance samples n p∗(s(k)) from the trial distribution, k=1 0 m∗ | →∞ (1) (n) 1 1 approaches L(α, β s) as n , where s ,...,s are ∗ ∝ ∗ pmix(s) ∗ MCMC samples simulated from p (s). The estimate (α, m ϕ(αk,βk) k=1 β) = arg max L (α, β|s) is called the Monte Carlo MLE α,β n (MCMLE) of (α, β). The maximization can be done using a βk × exp αk si + si sj , (18) 2 conventional optimization procedure, say the conjugate gradi- i∈D i∈D j∈N(i) ∗ = ∗ ent method. Setting p (s) pmix(s), the five runs of SAMC which represents an approximation to (16), with ϕ(αj,βj) re- resulted in five estimates of (α, β). The mean and standard de- placed by its estimate obtained in the first stage. In this stage viation vectors of these estimates were (−.2994, .1237) and 7 SAMC was run with δt ≡ 0 and for 10 iterations, and a total of (.00063, .00027). Henceforth, these estimates are called mix- 105 samples were harvested at equally spaced time points. Each MCMLEs, because they are obtained based on a mixture trial run cost about 115 minutes of CPU time in a 2.8-GHz computer. distribution. Figure 4(b) shows the fitted mortality map based In the second stage, SAMC is reduced to RJMCMC by setting on one mix-MCMLE (−.2999,.1234). ∗ δt = 0. Figure 5(a) shows one estimate of ϕ(α,β) obtained in a In the literature, p (s) is often constructed based on a single run of SAMC. parameter point, that is, setting Using the importance samples collected earlier, we estimated ∗ =+ | ∗ ∗ β the probability P(si 1 α, β), which is a function of (α, β). p (s) ∝ exp α si + si sj , (20) = + 2 The estimation can be done in (9) by setting h(s) i∈D(si i∈D i∈D j∈N(i)

(a) (b)

Figure 5. Computational Results of SAMC. (a) Estimate of logϕ(α, β)ona21× 11 lattice with α {−.5, −.45,..., .5}and β ∈ {0,.05,..., .5}. (b) Estimate of P(si =+1|α, β)ona50× 25 lattice with α ∈ {−.49, −.47,...,.49}andβ ∈ {.01,.03,...,.49}. 316 Journal of the American Statistical Association, March 2007 where (α∗,β∗) denotes the parameter point. The point (α∗,β∗) Table 2. Comparison of the Accuracy of the Mix-MCMLEs and Single-MCMLEs for the U.S. Cancer Data should be chosen to be close to the true parameter point; other- wise, a large value of n would be required for the convergence Estimate Single-MCMLE Mix-MCMLE ∗ ∗ of (19). Sherman et al. (2006) set (α ,β ) to be the maximum sim RMSE(t1 ) 59.51 2.90 pseudolikelihood estimate (Besag 1975) of (α, β), which is the sim RMSE(t ) 114.91 4.61 MLE of the pseudolikelihood function 2 NOTE: RMSE(tsim )iscalculatedas 5 (tsim,k − tobs)2/5, where i = 1, 2, and tsim,k denotes | i k = 1 i i i PL(α, β s) the value of tsim calculated based on the kth estimate of (α, β). i exp{si(α + β ∈ sj)} = j N(i) . exp{α + β s }+exp{−α − β s } i∈D j∈N(i) j j∈N(i) j the quantities T1 and T2 by drawing samples from the dis- tribution f (s|α,β) .If(α,β) is accurate, then we should have (21) tobs ≈ tsim, where tobs and tsim denote the values of T calcu- We repeated the procedure of Sherman et al. for the cancer data lated from the true observation and from the simulated sam- five times with n = 105 and the MCMC samples collected at ples. To calculate tsim, we generated 1,000 independent config- equally spaced time points in a run of the Gibbs sampler of 107 urations conditional on each estimate, with each configuration iteration cycles. The mean and standard deviation vectors of the generated by a short run of the Gibbs sampler. The Gibbs sam- resulting estimates are (−.3073, .1262) and (.00837, .00946). pler started with a random configuration and was iterated for These estimates have a significantly higher variation than the 1,000 cycles. A convergence diagnostic shows that 1,000 iter- mix-MCMLEs. Henceforth, these estimates are called single- ation cycles were long enough for the Gibbs sampler to reach MCMLEs, because they are obtained based on a single-point equilibrium for simulation of f (s|α,β) . Table 2 compares the trial distribution. root mean squared errors (RMSEs) of tsim’s calculated from the To compare the accuracy of the mix-MCMLEs and single- mix-MCMLEs and single-MCMLEs. The comparison shows MCMLEs, we conducted the following based on the that the mix-MCMLEs are much more accurate than the single- principle of the parametric bootstrap method (Efron and Tibshi- MCMLEs for this example. = = 1 rani 1993). Let T1 i∈D si and T2 2 i∈D si( j∈N(i) sj). For comparison, RJMCMC was also run for this example 8 It is easy to see that T = (T1, T2) forms a sufficient statistic for 10 iterations. The simulation started with model M0,0, of (α, β). Given an estimate (α,β), we can reversely estimate moved to model M10,10 very fast, and then got stuck there.

(a) (b)

(c) (d)

Figure 6. Comparison of SAMC and RJMCMC. (a) and (b) The sample paths of α and β in a run of SAMC. (c) and (d) The sample paths of α and β in a run of RJMCMC. Liang, Liu, and Carroll: Stochastic Approximation in Monte Carlo Computation 317

This is shown in Figures 6(c) and 6(d), where the parame- Ei2 denote the peak and brim sets of p(xi, i). SAMC can then ter vector (α, β) = (0, 0) corresponds to model M0 and (.5, work on the distribution with this partition and appropriate pro- .5) corresponds to model M10,10. RJMCMC failed to esti- posal distributions (dimension jumping will be involved). As mate ϕ(α1,β1),...,ϕ(αm,βm) simultaneously. This phenom- shown by Liang (2003), working on such a sequence of trial enon can be easily understood from Figure 5(a), which shows distributions indexed by dimension can help the sampler reduce that model M10,10 has a dominated probability over other mod- the curse of dimensionality. We note that the auxiliary variable els. Fives runs of SAMC produced an estimate of the log- used in constructing the joint distribution is not necessarily the log P(M10,10)/P(M0,0). The estimate is 1,775.7 with dimension variable; the temperature variable can be used as in standard deviation .6. Making transitions between models with simulated tempering for some problems for which the dimen- such a huge difference in probability is beyond the ability of sion change is not sensible. RJMCMC. It is also beyond the ability of other advanced In our theoretical results on convergence, we assume that the MCMC samplers, such as simulated tempering, parallel tem- sample space X and the parameter space  are both compact. pering, and evolutionary Monte Carlo, because the strength of At least in principle, these restrictions can be removed as was these advanced MCMC samplers is at making transitions be- done by Andrieu et al. (2005). If the restrictions are removed, tween different modes of the distribution instead of sampling then we may need to put some other constraints on the tails of from low probability models. However, it is not difficult for the target distribution p(x) and the proposal distribution q(x, y) SAMC due to its ability to sample rare events from a large sam- to ensure the minorization condition holds (see Roberts and ple space. For comparison, Figures 6(a) and 6(b) plot the sam- Tweedie 1996; Rosenthal 1995; Roberts and Rosenthal 2004 for ple paths of α and β obtained in a run of SAMC. This figure in- more discussions on the issue). Our numerical experience indi- dicates that even though the models have very large differences cates that SAMC should have some type of convergence even in probabilities, SAMC can still mix them well and sample each when the minorization condition does not hold, in a manner model equally. Note that the desired sampling distribution has similar to the MH algorithm (Mengersen and Tweedie 1996). A been set to the uniform distribution for this example and other further study in this direction is of some interest. examples of this section. APPENDIX: THEORETICAL RESULTS ON 6. DISCUSSION STOCHASTIC APPROXIMATION MONTE CARLO The appendix is organized as follows. In Section A.1 we describe a In this article we have introduced the SAMC algorithm and theorem for the convergence of the SAMC algorithm. In Section A.2 studied its convergence. SAMC overcomes the shortcomings we briefly review the published results on the convergence of a general of the WL algorithm. It can improve the estimates continu- stochastic approximation algorithm. In Section A.3 we give a proof for ously as the simulation proceeds. Two classes of applications— the theorem described in Section 1. importance sampling and model selection—are discussed. A.1 A Convergence Theorem for SAMC SAMC can work as a general importance sampling method and a model selection sampler when the model space is complex. Without loss of generality, we show only the convergence presented in (7) for the case where all subregions are nonempty or, equivalently, As with many other Monte Carlo algorithms, such as slice d = 0. Extension to the case d = 0 is trivial, because changing step (2)  sampling (Neal 2003), SAMC also suffers from the curse of of the SAMC algorithm to (2) will not change the process of simula- dimensionality. For example, consider the modified witch’s hat tion: distribution studied by Geyer and Thompson (1995),   (2) Set θ = θt + γt(et − π − d),whered is an m-vector of d. 1 + β, x ∈[0,α]k p(x) = (22) Theorem A.1. Let E ,...,Em be a partition of a compact sam- 1, x ∈[0, 1]k \[0,α]k, 1 ple space X and ψ(x) be a nonnegative function defined on X with 14 0 < ψ(x) dx < ∞ for all E ’s. Let π = (π ,...,π ) be an m- where k = 30 is the dimension of x, α = 1/3, and β ≈ 10 , Ei i 1 m m = which is chosen such that the probability of the peak is 1/3ex- vector with 0 <πi < 1and i=1 πi 1. Let  be a compact set of ˘ actly. For this distribution, the small hypercube is called the m dimensions, and let there exist a constant C such that θ ∈ ,where ˘ = ˘ ˘ ˘ = + − ∈ peak, and the rest are called the brim. It is easy to see that θ (θ1,...,θm) and θi C log( E ψ(x) dx) log(πi).Letθ0  ˘ i ˘ SAMC is not better than MH for sampling from this distribution be an initial estimate of θ and let θt ∈  be the estimate of θ at itera- { } if the sample space is partitioned according to the energy func- tion t.Let γt be an nonincreasing, positive sequence as specified in (6). Suppose that p (x) is bounded away from 0 and ∞ on X ,andthe tion. The peak is like an atom, so SAMC will make a random θt proposal distribution satisfies the condition (4). As t →∞,wehave walk in the brim just like MH. The likelihood of SAMC jump- ing into the peak from the brim is decreasing geometrically as P lim θti = C + log ψ(x) dx − log(πi) = 1, t→∞ the dimension increases. One way to overcome this difficulty is Ei to include an auxiliary variable in (22) and to work on the joint i = 1,...,m, (A.1) distribution, where C is an arbitrary constant. 1 + β , x ∈[0,α]I p(x , I) = I I (23) A.2 Existing Results on the Convergence of a General I ∈[ ]I \[ ]I 1, xI 0, 1 0,α , Stochastic Approximation Algorithm where I is the dimension of xI with I ∈{1,...,30} and βI is Suppose that our goal is to solve the following integration equation chosen such that the peak probability is 1/3 exactly. To sam- for the parameter vector θ: ple from (23), we can make a joint partition on I and energy. h(θ) = H(θ, x)p(dx) = 0,θ∈ . Let E11, E12,...,Ek1, Ek2 denote the partition, where Ei1 and X 318 Journal of the American Statistical Association, March 2007

The stochastic approximation algorithm with MCMC innovations where c = log(S) can be determined by imposing a constraint on S.For (noise) works iteratively as follows. Let K(xt, ·) beaMCMCtransi- example, setting S = 1 leads to c = 0. It is obvious that L is nonempty tion kernel, for example, the MH kernel of the form and that w(θ) = 0foreveryθ ∈ L. To verify the conditions (A1-a), (A1-c), and (A1-d), we have the K(xt, dy) = s(xt, dy) + I(xt ∈ dy) 1 − s(xt, z) dz , following calculations: X where s(xt, dy) = q(xt, dy) min{1, [p(y)q(y, xt)]/[p(x)q(x, y)]}, q(·, ·) ∂S ∂Si = =−Si, is the proposal distribution, and p(·) is the invariant distribution. Let ∂θi ∂θi ˜ ˜ ∞  ⊂  be a compact subset of .Let{γt} be a monotone nonin- t=0 ∂Si ∂Sj creasing sequence governing the step size. In addition, define a func- = = 0, X × ˜ → X × ∂θj ∂θi tion  :  , which reinitializes the nonhomogeneous (A.3) Markov chain {(x ,θ )}. For instance, the function  can generate a ∂(S /S) S S t t i =− i 1 − i , and random or fixed point, or project (xt+1,θt+1) onto X × . An itera- ∂θi S S tion of the algorithm is as follows: ∂(S /S) S S ∂(Si/S) = j = i j ∼ · 1. Generate y Kθ (xt, ). ∂θj ∂θi S2 ∗ t 2. Set θ = θt + γt+1H(θt, y). ∗ ∗ 3. If θ ∈ ,thenset(xt+1,θt+1) = (y,θ ); otherwise, set for i, j = 1,...,m and i = j,and ∗ (xt+1,θt+1) = (y,θ ). m ∂w(θ) 1 ∂(S /S − π )2 This algorithm is actually a simplified version of the algorithm = k k P ∂θi 2 ∂θi given by Andrieu et al. (2005). Let x0,θ0 denote the probability mea- k=1 sure of the Markov chain {(xt,θt)}, started in (x0,θ0), and implicitly Sj SiSj Si Si Si defined by the sequences {γt}.DefineD(x, A) = inf ∈ |x − y|. = − π − − π 1 − y A j 2 i = S S S S S Theorem A.2 (Thm. 5.5 and prop. 6.1 of Andrieu et al. 2005). As- j i sume that the conditions (A ) and (A ) hold, and there exists a drift m 1 4 Sj SiSj Si Si function V(x) such that sup V(x)<∞ and the drift condition = − πj − − πi x∈X S S2 S S holds (see Andrieu et al. 2005 for descriptions of the conditions). Let j=1 the sequence {θn} be defined as in the stochastic approximation algo- Si Si Si rithm. Then for all (x ,θ ) ∈ X × , = µη − − π (A.4) 0 0 S S i S lim D(θt, L) = 0, Px ,θ -a.e. t→∞ 0 0 m Sj Sj for i = 1,...,m, where it is defined as µη = = ( − πj) . Thus A.3 Proof of Theorem A.2 j 1 S S ∇  To prove Theorem A.2, it suffices to verify that (A1), (A4),andthe w(θ), h(θ) drift condition hold for the SAMC algorithm. To simplify notation, m m 2 = Si Si Si Si in the proof we drop the subscript t by denoting xt by x and θt = µη − πi − − πi = S S S S (θt1,...,θtm) by θ (θ1,...,θm). Because the invariant distribution i=1 i=1 of the MH kernel is pθ (x),foranyfixedθ,wehave m 2 =− Si − Si − 2 (i) (i) πi µη E ex − πi = ex − πi pθ (x) dx S S X i=1 θ =− 2 ≤ ψ(x) dx/e i ση 0, (A.5) Ei = − πi m [ ψ(x) dx/eθk ] k=1 Ek 2 where ση denotes the variance of the discrete distribution defined by Si the following table: = − π , i = 1,...,m, (A.2) S i θi m where Si = ψ(x) dx/e and S = = Sk. Thus S1 Sm Ei k 1 State (η) − π1 ··· − πm S S  Probability S1 ··· Sm S1 Sm S S h(θ) = H(θ, x)p(dx) = − π1,..., − πm . X S S If θ ∈ L,then∇w(θ), h(θ)=0; otherwise, ∇w(θ), h(θ) < 0. Condition A . It follows from (A.2) that h(θ) is a continuous func- 1 For any M0 ∈ (0, 1],itistruethatL ⊂{θ ∈ ,w(θ) < M0}. Hence = 1 m Sk − 2 tion of θ.Letw(θ) 2 k=1( S πk) . As shown later, w(θ) has condition (A1-a) is satisfied. ≤ ≤ continuous partial derivatives of the first order. Because 0 w(θ) It follows from (A.5) that ∇w(θ), h(θ)≤0forallθ ∈ .The 1 [ m Sk 2 + 2 ]≤ ∈ L 2 k=1( S ) πk ) 1forallθ ,and itself is compact, the w( ) forms a line in space , because it contains only one free para- W ={ ∈ ≤ } level set M θ ,w(θ) M is compact for any positive inte- meter c. Therefore, the interior set of w(L) is empty. Conditions (A1-c) ger M. Condition (A1-b) is satisfied. and (A1-d) are satisfied. Solving the system of equations formed by (A.2), we have = = Condition A4. Let p be arbitrarily large, β 1, α 1, and L = = + − ∈ 1 ∞ =∞ ∞ ζ ∞ (θ1,...,θm) : θi c log ψ(x) dx log(πi), ζ ( τ , 2). Thus conditions t=1 γt and t=1 γt < hold. Ei |H(θ, x)| c Because is bounded above by 1, as shown in (A.8), | | η γtH(θt−1, xt) < c1γt < c1γt holds. Condition (A4) is satisfied by i = 1,...,m; θ ∈  , choosing C = c1 and η ∈[(ζ − 1)/α, (p − ζ)/p]=[ζ − 1, 1). Liang, Liu, and Carroll: Stochastic Approximation in Monte Carlo Computation 319 Drift Condition. Theorem 2.2 of Roberts and Tweedie (1996) + I(x ∈ A) |sθ (x, z) − sθ (x, z)| dz ∞ X shows that if the target distribution is bounded away from 0 and X on every compact set of its support , then the MH chain with a pro-  ≤ 2 |sθ (x, y) − sθ (x, y)| dy ≤ 2c2|θ − θ |. (A.12) posal distribution satisfying the condition (4) is irreducible and ape- X riodic, and every nonempty compact set is small. Hence Kθ ,theMH X → Rd   = |g(x)| kernel used in each iteration of SAMC, is irreducible and aperiodic For g : , define the norm g V supx∈X V(x) . Then, for any ∈ X X d for any θ . Because is compact, is a small set, and thus the function g ∈ LV ={g : X → R , gV < ∞},wehave minorization condition is satisfied; that is there exists an integer l such Kθ g − Kθ gV that l inf K (x, A) ≥ δν(A), ∀x ∈ X , ∀A ∈ B. (A.6) = −  ∈ θ Kθ (x, dy) Kθ (x, dy) g(y) θ  V Define Kθ V(x) = X Kθ (x, y)V(y) dy. Because C = X is small, the = −  following conditions hold: Kθ (x, dy) Kθ (x, dy) g(y) X + l p ≤ p + ∈ ∀ ∈ X ; sup Kθ V (x) λV (x) bI(x C), x ∈ θ 0 + Kθ (x, dy) − Kθ (x, dy) g(y) (A.7) X − V sup K Vp(x) ≤ κVp(x), ∀x ∈ X , θ θ∈0 ≤ max Kθ (x, dy) − Kθ (x, dy) g(y), by choosing the drift function V(x) = 1,  = ,0<λ<1, b = X + 0 1 − λ, κ>1, p ∈[2, ∞) and any integer l. Equations (A.6) and (A.7) − −  imply that (DRI1) is satisfied. Kθ (x, dy) Kθ (x, dy) g(y) X − V Let H(i)(θ, x) be the ith component of the vector H(θ, x) = (e − x + + (i) (i) ≤g max |K (x, X ) − K  (x, X )|, π). By construction, |H (θ, x)|=|e − πi| < 1forallx ∈ X and V θ θ x √ i = 1,...,m. Therefore, there exists a constant c = m such that for − − 1 |K (x, X ) − K  (x, X )| all x ∈ X , θ θ  ≤ 2c2gV |θ − θ | [following from (A.12)], sup |H(θ, x)|≤c1. (A.8) θ∈ + − where X ={y : y ∈ X ,(Kθ (x, dy) − Kθ (x, dy))g(y)>0} and X = + In addition, H(θ, x) does not depend on θ for a given sample x. Hence X \ X . This implies that condition (DRI3) is satisfied by choosing   H(θ, x) − H(θ , x) = 0forall(θ, θ ) ∈  × , and the following con- V(x) = 1andβ = 1. The proof is completed. dition holds for the SAMC algorithm: [Received June 2005. Revised July 2006.]   sup |H(θ, x) − H(θ , x)|≤c1|θ − θ |. (A.9) (θ,θ)∈× REFERENCES Equations (A.8) and (A.9) imply that (DRI2) is satisfied by choosing Andrieu, C., Moulines, É., and Priouret, P. (2005), “Stability of Stochastic Ap- = = proximation Under Verifiable Conditions,” SIAM Journal of Control and Op- β 1andV(x) 1. timization, 44, 283–312. Let sθ (x, y) = q(x, y) min{1, r(θ, x, y)},wherer(θ, x, y) = pθ (y) × Augustin, N., Mugglestone, M., and Buckland, S. (1996), “An Autologistic q(y, x)/pθ (x)q(x, y). Thus we have Model for Spatial Distribution of Wildlife,” Journal of Applied , 33, 339–347. ∂sθ (x, y) = − = = Besag, J. E. (1974), “Spatial Interaction and the Statistical Analysis of Lattice q(x, y)I(r(θ, x, y)<1)I J(x) i or J(y) i Systems” (with discussion), Journal of the Royal Statistical Society,Ser.B, ∂θi 36, 192–236. × I(J(x) = J(y))r(θ, x, y) (1975), “Statistical Analysis of Non-Lattice Data,” The , 24, 179–195. ≤ q(x, y), Benveniste, A., Métivier, M., and Priouret, P. (1990), Adaptive Algorithms and Stochastic Approximations, New York: Springer-Verlag. where I(·) is the indicator function and J(x) denotes the index of the Berg, B. A., and Neuhaus, T. (1991), “Multicanonical Algorithms for 1st-Order subregion to which x belongs to. The mean-value theorem implies that Phase-Transitions,” Physics Letters B, 267, 249–253. there exists a constant c such that Cappé, O., Guillin, A., Marin, J. M., and Robert, C. P. (2004), “Popula- 2 tion Monte Carlo,” Journal of Computational and Graphical Statistics, 13,  907–929. |sθ (x, y) − sθ (x, y)|≤q(x, y)c2|θ − θ |, (A.10) Delyon, B., Lavielle, M., and Moulines, E. (1999), “Convergence of a Stochas- which implies that tic Approximation Version of the EM Algorithm,” The Annals of Statistics, 27, 94–128. Diggle, P. J., and Gratton, R. J. (1984), “Monte Carlo Methods of Inference for sup sθ (x, ·) − sθ (x, ·)1 = sup |sθ (x, y) − sθ (x, y)| dy x x X Implicit Statistical Models” (with discussion), Journal of the Royal Statistical Society, Ser. B, 46, 193–227.  ≤ c2|θ − θ |. (A.11) Efron, B., and Tibshirani, R. J. (1993), An Introduction to the Bootstrap, Lon- don: Chapman & Hall. In addition, for any measurable set A ⊂ X ,wehave Gelfand, A. E., and Banerjee, S. (1998), “Computing Marginal Posterior Modes Using Stochastic Approximation,” technical report, University of Connecti- |Kθ (x, A) − Kθ (x, A)| cut, Dept. of Statistics. Gelman, A., and Rubin, D. B. (1992), “Inference From Iterative Simulation Using Multiple Sequences” (with discussion), Statistical Science, 7, 457–472. = [sθ (x, y) − sθ (x, y)] dy A Geman, S., and Geman, D. (1984), “Stochastic Relaxation, Gibbs Distribu- tions and the Bayesian Restoration of Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741. + I(x ∈ A) [sθ (x, z) − sθ (x, z)] dz X Geyer, C. J. (1991), “Markov Chain Monte Carlo Maximum Likelihood,” in Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, ed. E. M. Keramigas, Fairfax, VA: Interface Foundation, ≤ |sθ (x, y) − sθ (x, y)| dy X pp. 156–163. 320 Journal of the American Statistical Association, March 2007

(1992), “Practical Monte Carlo Markov Chain” (with discussion), Sta- Marinari, E., and Parisi, G. (1992), “Simulated Tempering: A New Monte Carlo tistical Science, 7, 473–511. Scheme,” Europhysics Letters, 19, 451–458. (1994), “On the Convergence of Monte Carlo Maximum Likelihood Mengersen, K. L., and Tweedie, R. L. (1996), “Rates of Convergence of the Calculations,” Journal of the Royal Statistical Society, Ser. B, 56, 261–274. Hastings and Metropolis Algorithms,” The Annals of Statistics, 24, 101–121. (1996), “Estimation and Optimization of Functions,” in Markov Chain Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Monte Carlo in Practice, eds. W. R. Gilks, S. Richardson, and D. J. Spiegel- Teller, E. (1953), “Equation of State Calculations by Fast Computing Ma- halter, London: Chapman & Hall, pp. 241–258. chines,” Journal of Chemical Physics, 21, 1087–1091. Geyer, C. J., and Thompson, E. A. (1992), “Constrained Monte Carlo Maxi- Moyeed, R. A., and Baddeley, A. J. (1991), “Stochastic Approximation of the mum Likelihood for Dependent Data,” Journal of the Royal Statistical Soci- MLE for a Spatial Point Pattern,” Scandinavian Journal of Statistics, 18, ety, Ser. B, 54, 657–699. 39–50. (1995), “Annealing Markov Chain Monte Carlo With Applications to Neal, R. M. (2003), “Slice Sampling” (with discussion), The Annals of Statis- Ancestral Inference,” Journal of the American Statistical Association, 90, tics 909–920. , 31, 705–767. ˘ Gilks, W. R., Roberts, R. O., and Sahu, S. K. (1998), “Adaptive Markov Chain Nevel’son, M. B., and Has’minskii, R. Z. (1973), Stochastic Approximation and Monte Carlo Through Regeneration,” Journal of the American Statistical As- Recursive Estimation, Providence, RI: American Mathematical Society. sociation, 93, 1045–1054. Preisler, H. K. (1993), “Modeling Spatial Patterns of Trees Attacked by Bark Green, P. J. (1995), “Reversible-Jump Markov Chain Monte Carlo Computation Beetles,” Applied Statistics, 42, 501–514. and Bayesian Model Determination,” Biometrika, 82, 711–732. Robbins, H., and Monro, S. (1951), “A Stochastic Approximation Method,” The Green, P. J., and Richardson, S. (2002), “Hidden Markov Models and Disease Annals of , 22, 400–407. Mapping,” Journal of the American Statistical Association, 97, 1055–1070. Robert, C. P., and Casella, G. (2004), Monte Carlo Statistical Methods Gu, M. G., and Kong, F. H. (1998), “A Stochastic Approximation Algo- (2nd ed.), New York: Springer-Verlag. rithm With Markov Chain for Incomplete Data Esti- Roberts, G. O., and Rosenthal, J. S. (2004), “General State-Space Markov mation Problems,” Proceedings of the National Academy of Sciences USA, Chains and MCMC Algorithms,” Probability Surveys, 1, 20–71. 95, 7270–7274. Roberts, G. O., and Tweedie, R. L. (1996), “Geometric Convergence and Cen- Gu, M. G., and Zhu, H. T. (2001), “Maximum Likelihood Estimation for Spatial tral Limit Theorems for Multidimensional Hastings and Metropolis Algo- Models by Markov Chain Monte Carlo Stochastic Approximation,” Journal rithms,” Biometrika, 83, 95–110. of the Royal Statistical Society, Ser. B, 63, 339–355. Rosenthal, J. S. (1995), “Minorization Conditions and Convergence Rate for Hastings, W. K. (1970), “Monte Carlo Sampling Methods Using Markov Markov Chain Monte Carlo,” Journal of the American Statistical Association, Chains and Their Applications,” Biometrika, 57, 97–109. 90, 558–566. Hesselbo, B., and Stinchcombe, R. B. (1995), “Monte Carlo Simulation and Global Optimization Without Parameters,” Physical Review Letters, 74, Sherman, M., Apanasovich, T. V., and Carroll, R. J. (2006), “On Estimation in 2151–2155. Binary Autologistic Spatial Models,” Journal of Statistical Computation and Hesterberg, T. (1995), “Weighted Average Importance Sampling and Defensive Simulation, 76, 167–179. Mixture Distributions,” Technometrics, 37, 185–194. Stavropoulos, P., and Titterington, D. M. (2001), “Improved Particle Filters and Honeycutt, J. D., and Thirumalai, D. (1990), “Metastability of the Folded States Smoothing,” in Sequential MCMC in Practice,eds.A.Doucet,N.deFreitas, of Globular Proteins,” Proceedings of the National Academy of Sciences USA, and N. Gordon, New York: Springer-Verlag, pp. 295–318. 87, 3526–3529. Swendsen, R. H., and Wang, J. S. (1987), “Nonuniversal Critical Dynamics in Hukushima, K., and Nemoto, K. (1996), “Exchange Monte Carlo Method and Monte Carlo Simulations,” Physical Review Letters, 58, 86–88. Application to Spin Glass Simulations,” Journal of the Physics Society of Tierney, L. (1994), “Markov Chains for Exploring Posterior Distributions” Japan, 65, 1604–1608. (with discussion), The Annals of Statistics, 22, 1701–1762. Jobson, J. D. (1992), Applied Multivariate Data Analysis, Vol. II: Categorical Torrie, G. M., and Valleau, J. P. (1977), “Non-Physical Sampling Distributions and Multivariate Methods, New York: Springer-Verlag. in Monte Carlo Free Energy Estimation: Umbrella Sampling,” Journal of Lai, T. L. (2003), “Stochastic Approximation,” The Annals of Statistics, 31, Computational Physics, 23, 187–199. 391–406. Wang, F., and Landau, D. P. (2001), “Efficient, Multiple- Random-Walk Liang, F. (2002), “Dynamically Weighted Importance Sampling in Monte Carlo Algorithm to Calculate the Density of States,” Physical Review Letters, 86, Computation,” Journal of the American Statistical Association, 97, 807–821. 2050–2053. (2003), “Use of Sequential Structure in Simulation From High- Warnes, G. R. (2001), “The Normal Kernel Coupler: An Adaptive Markov Dimensional Systems,” Physical Review E, 67, 56101–56107. Chain Monte Carlo Method for Efficiently Sampling From Multi-Modal Dis- (2004), “Annealing Contour Monte Carlo for Structure Optimiza- tributions,” Technical Report 395, University of Washington, Dept. of Statis- tion in an Off-Lattice Protein Model,” Journal of Chemical Physics, 120, tics. 6756–6763. (2005), “Generalized Wang–Landau Algorithm for Monte Carlo Com- Wong, W. H., and Liang, F. (1997), “Dynamic Weighting in Monte Carlo and putation,” Journal of the American Statistical Association, 100, 1311–1327. Optimization,” Proceedings of the National Academy of Sciences USA, 94, Liang, F., and Wong, W. H. (2001), “Real Parameter Evolutionary Monte Carlo 14220–14224. With Applications in Bayesian Mixture Models,” Journal of the American Yan, Q., and de Pablo, J. J. (2003), “Fast Calculation of the Density of States of Statistical Association, 96, 653–666. a Fluid by Monte Carlo Simulations,” Physics Review Letters, 90, 035701. Liu, J. S. (2001), Monte Carlo Strategies in Scientific Computing,NewYork: Younes, L. (1988), “Estimation and Annealing for Gibbsian Fields,” Annales Springer-Verlag. de l’Instut Henri Poincaré, 24, 269–294. Liu, J. S., Liang, F., and Wong, W. H. (2001), “A Theory for Dynamic Weight- (1999), “On the Convergence of Markovian Stochastic Algorithms inginMonteCarlo,”Journal of the American Statistical Association, 96, With Rapidly Decreasing Ergodicity Rates,” Stochastics and Stochastics Re- 561–573. ports, 65, 177–228.