<<
Home , RWM

Hierarchical models and the tuning of RWM algorithms

Myl`eneB´edard D´epartement de math´ematiqueset de statistique Universit´ede Montr´eal Montr´eal,Canada, H3C 3J7 [email protected]

Abstract

We obtain weak convergence and optimal scaling results for the random walk Metropo- lis algorithm with Gaussian proposal distribution. The sampler is applied to high- dimensional hierarchical target distributions, which form the building block of any Bayesian analysis. The global asymptotically optimal proposal variances derived may be easily computed as a function of the specific target distribution considered. We also introduce the concept of locally optimal tunings, i.e. tunings that depend on the current position of the Markov chain. The theorems are proved by studying the generator of the first and components of the algorithm, and verifying their convergence to the generator of a modified RWM algorithm and of a diffusion process, respectively. The rate at which the algorithm is exploring its state is optimized by studying the speed measure of the limiting diffusion process. We illustrate the theory with two examples.

AMS 2000 subject classifications: Primary 60F05; secondary 65C40. Keywords: acceptance probability; diffusion process; efficiency; position-dependent re- versibility; RWM-within-Gibbs; weak convergence

1 1 Introduction

Randow walk Metropolis (RWM) algorithms are widely used to sample from complex or high-dimensional probability distributions ([12], [9]). The simplicity and versatility of this sampler often makes it the default option in the MCMC toolbox. Implementing a RWM algorithm involves a tuning step, to ensure that the process explores its state space as fast as possible, and that the sample produced be representative of the probability distribution of interest (the target distribution). In this paper, we solve the tuning problem for a large class of target distributions with correlated components. This issue has mainly been studied for product target densities, but the attention has recently turned towards more complex target models ([6], [10]); in practice, cases with correlated components are often tuned by trial and error and thus need to be further analyzed. The specific type of target distribution considered here is formed of components which are related according to a hierarchical structure. These distributions are ubiquitous in several fields of research (finance, biostatistics, physics, to name a few), and constitute the basis of any Bayesian inference. To implement a RWM algorithm, the user must select a proposal distribution from which are generated candidates for the sample. This distribution should ideally be similar to the target, while remaining accessible from a sampling viewpoint. A pragmatic choice is to let the proposed moves be normally distributed around the latest value of the sample. Tuning the variance of the normal proposal distribution (σ2) has a significant impact on the speed at which the sampler explores its state space (hereafter referred to as “efficiency”), with extremal variances leading to slow-mixing algorithms. In particular, large variances seldom induce suitable candidates and result in lazy processes; small variances yield hyperactive processes whose tiny steps lead to a -consuming exploration of the state space. Seeking for an intermediate value that optimizes the efficiency of the RWM algorithm, i.e. a proposal variance that produces steps as big as possible, accepted as often as possible, is called the optimal scaling problem. The optimal scaling issue of the RWM algorithm with Gaussian proposal has been addressed by many researchers over the last few . It has been determined in [15] that high- dimensional target densities formed of independent and identically distributed (i.i.d.) com- 2 0 ponents correspond to an optimal proposal varianceσ ˆ (n) ≈ 5.66/{nE [(log f(X)) ]}, where f is the density of one target component and n the number of target components. This op- timal proposal variance has also be shown to correspond to an optimal expected acceptance rate of 23.4%, where the acceptance rate is defined as the proportion of candidate values that are accepted by the algorithm. This result has contributed to a better understanding of the asymptotics of the RWM sampler (as n → ∞). Generalizing this conclusion is an intricate task and further research on the subject has mainly been restricted to the case of high-dimensional target distributions formed of independent components (see [16], [13], [2], [3], [4], [5]). In the specific case of the multivariate normal target distribution however, the asymptotically optimal variance and acceptance rate may be easily determined (see [13], [1]). Lately, [6] and [10] have also performed scaling analyses of non-product target den- sities. These advances are important, as MCMC methods are mainly used when dealing with complex models, which only rarely satisfy the independence assumption among target components. These results however assume that the correlation structure among target com- ponents is known, and thus used to generate candidates for the chain. This therefore yields,

2 as expected, an asymptotically optimal acceptance rate of 23.4% (see [16] for an explanation). In this paper, we focus on solving the optimal scaling problem for a wide class of models that include a dependence relationship, the hierarchical distributions. Weak convergence and optimal scaling results are derived without explicitly characterizing the dependency among target components, and thus rely on a Gaussian proposal distribution with diagonal covari- ance matrix. The specific form of the theoretical results derived seem to support the use of RWM-within-Gibbs over RWM algorithms. In the next section, we describe the target dis- tribution and introduce some notation related to the RWM sampler. The theoretical optimal scaling results are stated in Section 3, and are then illustrated with two examples in Section 4. Extensions are briefly discussed in Section 5, while appendices contain proofs.

2 Framework

Consider an n-dimensional target distribution consisting of a mixing component X1 and of n−1 conditionally i.i.d. components Xi (i = 2, . . . , n) given X1. Suppose that this distribution has a target density π with respect to Lebesgue measure, where

n Y π (x) = f1 (x1) f (xi |x1 ) . (1) i=2

To characterize the asymptotic behaviour of a component Xi (i = 2, . . . , n), we impose some regularity conditions on the densities f1 and f. The density f1 is assumed to be a continuous function on R, with X1 = {x1 : f1(x1) > 0} forming an open interval. 2 ∂ For all fixed x1 ∈ X1, f(x|x1) is a positive C density on R and ∂x log f(x|x1) is Lipschitz 2 2 continuous with constant K(x1) such that E[K (X1)] < ∞. Here, C denotes the space of real-valued functions with continuous second derivative. For all fixed x ∈ X = R, f(x|x1) is 2 ∂ a C function on X1 and ∂x log f(x|x1) is Lipschitz continuous with constant L(x) such that 4 E[L (X)] < ∞. Furthermore,

h ∂ 4i h ∂ 4i EX ∂X log f(X|x1) < ∞ ∀x1 ∈ X1 with E ∂X log f(X|X1) < ∞ ; (2)

hereafter, the notation EX [·] means that the expectation is computed with respect to X. Where there is no confusion possible, E[·] shall be used to denote an expectation with respect to all the random variables in the expression. The above regularity conditions constitute an extension of those stated in [3] for target distributions formed of independent components, and are weaker than would be a Lipschitz continuity assumption on the bivariate function ∂ ∂x log f(x|x1). They also imply that the Lipschitz constants K(x1) and L(x) themselves satisfy a Lipschitz condition.

We now impose further conditions on f(x|x1) to thoroughly account for the variation of the component X1 when studying the asymptotic behaviour of a component Xi (i = 2, . . . , n). For almost all fixed x ∈ X , ∂ log f(x|x ) is Lipschitz continuous with constant M(x) such ∂x1 1 2 that E[M (X)] < ∞ and  2  2 ∂ log f(X|x ) < ∞ ∀x ∈ X with ∂ log f(X|X ) < ∞ . (3) EX ∂x1 1 1 1 E ∂X1 1

3 Finally, in order to characterize the asymptotic behaviour of the mixing component X1, let X2:n = (X2,...,Xn), X = (X2,X3,...), and →p denote convergence in probability. Assume that V ar(X1|X2:n) →p 0, and denote µ ≡ µ(X) such that µn ≡ µn(X2:n) = EX1 [X1|X2:n] →p µ as n → ∞, with |µ| < ∞. Hereafter, we make a small abuse of notation by letting µ and µ sometimes denote the random variable or the observation, depending on the context. n √ ˜ n−1 Furthermore, define X1 = n(X1 − µn); for almost all x2:n ∈ R , the conditional density ˜ √1 √x˜1 of X1 given x2:n, n f1(µn + n |x2:n), is assumed to converge almost surely to g1(˜x1|x), a continuous density on R with respect to Lebesgue measure. These assumptions simply mean that the limiting density of X1|x2:n is degenerate, but that an appropriate rescaling leads to a non-trivial density on R. To obtain a sample from the target density in (1), we rely on the RWM algorithm with Gaus- sian proposal distribution. The idea behind this method is to build an n-dimensional Markov (n) chain {X [j]; j ∈ N} having the target distribution as its stationary distribution. One iter- ation of this algorithm is performed according to the following steps. Given X(n)[j] = x, the (n) 2 time-j state of the Markov chain, a value Y [j + 1] = y is generated from a N (x, σ (n)In), 2 (n) where σ (n) > 0 and In is the n-dimensional identity matrix. The candidate Y [j + 1] = y is accepted as a suitable value for the Markov chain with probability α(x, y) = 1∧π(y)/π(x), in which case we set X(n)[j + 1] = y; otherwise, the Markov chain remains at the same state for another time interval and X(n)[j + 1] = x. Although generally not emphasized in the literature, we note that the proposal variance may be a function of x, resulting in a non- homogeneous random walk sampler. In that case, the ratio of proposal densities should be ex- plicitly considered in the acceptance probability: α(x, y) = 1 ∧ π(y)qn(x; y)/{π(x)qn(y; x)}, 2 where qn(y; x)) is the density of a N (x, σ (n, x)In). The first thought of most MCMC users when facing a target density as in (1) would be to use a RWM-within-Gibbs algorithm, which consecutively updates subgroups of the n components in a given iteration. Unfortunately, optimal scaling results for simultaneously updating a finite number of components are available for specific target distributions only (Gaussian and other elliptically symmetric unimodal targets, see [18]). Tuning of high-dimensional RWM- within-Gibbs algorithms has been addressed in [13], but only for target distributions with i.i.d. components and Gaussian targets with correlation. We outline the fact that RWM and RWM-within-Gibbs samplers both have the potential of being geometrically ergodic, but none can yield a uniformly ergodic Markov chain (see [11], [17], [14]). Focusing on RWM algorithms is thus a good starting point to understand the behaviour of various MCMC methods applied to high-dimensional target distributions with a hierarchical structure. In the sequel we focus on deriving weak convergence and optimal scaling results for the RWM algorithm with Gaussian proposal by letting n, the dimension of the target density in (1), approach ∞. Traditionally, asymptotically optimal scaling results have been obtained by studying the limiting path of a given component (X2 say) as n → ∞. In the case of target distributions with i.i.d. components (and some extensions), the components of the RWM algorithm are asymptotically independent of each other and their limiting behaviour is regimented by identical one-dimensional Markovian processes. In the current correlated framework, we expect the presence of an asymptotic dependence relationship among Xi (i ∈ {2, . . . , n}) and X1, in the spirit of (1). In order to obtain non-trivial limiting processes describing the behaviour of the components

4 X2,...,Xn given the current state of X1 in the RWM sampler as n → ∞, space- and time- rescaling factors need to be applied to the algorithm. These factors are chosen according to the rate at which the component under study explores its state space (given the other components). We first note that for a fixed proposal variance σ2 (n) = σ2, the odds of reject- ing an n-dimensional candidate value increase with n and lead to an acceptance probability for the algorithm that converges to 0. To overcome this problem we then let the proposal variance be a decreasing function of the dimension by setting σ2 (n) = `2/n. Since σ2(n) → 0 as n → ∞, the chosen space rescaling factor by itself leads to an algorithm whose limiting behaviour is constant. It then becomes necessary to also apply a time rescaling factor to the RWM algorithm, and so the time interval between each proposed candidate is set to 1/n. In fact, it shall be convenient to state results in terms of the continuous-time, sped-up version of the initial Markov chain defined as {W(n)(t); t ≥ 0} = {X(n)[bntc]; t ≥ 0}, where b·c is the floor function. For target distributions with i.i.d. components, all components in the n-dimensional algorithm clearly mix at the same rate (see [15]). As soon as the i.i.d. assumption is relaxed however, this statement does not necessarily hold and the mixing rate of one component might signif- icantly differ from the mixing rate of another component. This might even happen under an assumption of independence among the target components; in an n-dimensional target distri- bution that is normally distributed around 0 with diagonal covariance matrix (1/n, 1,..., 1) for instance, X1 explores its state space significantly faster than X2:n as n → ∞. This as- 2 sumes, as mentioned previously, that a proposal variance of the form ` In/n is used; we refer the reader to [3] for a detailed treatment of this case. An implication of these differing mix- ing rates is the need for studying the one-dimensional paths of the multidimensional Markov chain separately, as they potentially require different rescaling factors. For hierarchical target distributions with density (1), it is intuitively clear that conditionally on the current state of X2,...,Xn, the component X1 requires fewer iterations than these other components to explore its state space. Indeed, the information about X1 contained in X2:n grows with n, resulting in a density for X1|x2:n that is increasingly concentrated, hence explaining the rapid mixing of X1 (over its conditional state space). In the current framework, where f1 in (1) is independent of n by assumption, the component X1 mixes according to O(1) given X2:n (compared to O(n) for any other component given X1). These conditional mixing rates are useful for tuning the algorithm, but do not provide information about the overall mixing rate of the sampler; the latter must be found on a case per case basis. (n) When the component of interest is X1, the process {W (t); t ≥ 0} thus moves too rapidly, on too small a scale to see what really happens. To account for these different mixing rates and obtain a non-trivial, conditional limiting process for X1 given X2:n, we apply a √ (n) transformation. Specifically, we let X˜1 = n(X1 − µn) and study the process {W˜ (t); t ≥ ˜ (n) (n) 0} = {(X1 [btc], X2:n[btc]); t ≥ 0}, a continuous-time version of the transformed Markov chain. In other words, we are now looking at a magnified, centered version of the path √ 2 associated to the first component; this implies that Y˜1 = n(Y1 − µn) ∼ N (˜x1, ` ) and 2 Yi ∼ N (xi, ` /n), i = 2, . . . , n.

5 3 Asymptotics of the RWM algorithm

In this section we introduce results about the limiting behaviour (as n → ∞) of the time- ˜ (n) (n) and scale-adjusted univariate processes {W1 (t); t ≥ 0} and {Wi (t); t ≥ 0} (i = 2, . . . , n). From these results we determine the asymptotically optimal scaling (AOS) values and the asymptotically optimal acceptance rate (AOAR) that optimize the mixing of the algorithm. Hereafter, we let ⇒ denote weak convergence in the Skorokhod topology and B(t) a Brownian motion at time t; the cumulative distribution function of a standard normal random variable is denoted by Φ(·). 2 Theorem 1. Consider a RWM algorithm with proposal distribution N (x, ` In/n) used to sample from a target density π as in (1). Suppose that π satisfies the conditions on f1 and f specified in Section 2, and that X(n)(0) is distributed according to π in (1).

 ˜ 2 If 1 Pn ∂ log f(X |X = µ + √X1 ) → γ˜(µ) with n i=2 ∂Xi i 1 n n p Z h ∂ 2i ∂ 2 γ˜(µ) = EX ∂X log f(X|µ(X)) = ∂x log f(x|µ(X)) f(x|µ(X))dx < ∞ , R ˜ (n) ˜ then the magnified process {W1 (t); t ≥ 0} ⇒ {W1(t); t ≥ 0}. Here, W1(0) and Wi(0) (i = 2, 3,...) are distributed according to the densities f1 and f respectively, which implies that W˜ 1(0) is distributed according to the density g1 in Section 2. Given the time-t state W˜ (t) = (˜x1, x), the process {W˜ 1(t); t > 0} evolves as the continuous-time version of a special RWM algorithm applied to the target density g1(˜x1|x); the proposal distribution of this algorithm is 2 a N (˜x1, ` ) and the acceptance rule is defined as

 g1(˜y1|x) `2   g1(˜y1|x) `2  log − 2 γ˜(µ) g (˜y |x) − log − 2 γ˜(µ) α∗ (˜x , y˜ |x) = Φ g1(˜x1|x) + 1 1 Φ g1(˜x1|x) .(4) 1 1  1/2   1/2  `γ˜ (µ) g1(˜x1|x) `γ˜ (µ)

Proof. See Appendix A.1.

This result describes the limiting path associated to the component X˜1 as n → ∞, which is Markovian with respect to F W˜ (t). This component mixes according to O(1) and thus explores its conditional state space much more efficiently than the other components, as shall be witnessed in Theorem 2. The asymptotic process found can be described as an atypical one- ∗ dimensional RWM algorithm, whose acceptance rule α (˜x1, y˜1|x) and target density g1(˜x1|x) both vary according to x at every iteration. We outline the fact that the acceptance function ∗ α in (4) satisfies the reversibility condition with respect to g1(˜x1|x) (see [3] for more details about this acceptance function). Theorem 1 is interesting from a theoretical perspective, but cannot be used to optimize the overall efficiency of the algorithm. Although we could try to determine the value of ` leading to the optimal mixing of the component X1, it would be wiser to focus instead on the mixing rate of the components X2,X3,..., as shall be soon realized. We remind the reader that because of these different (conditional) mixing rates, it is not possible to study the limiting behaviour of the components X1 and X2 (say) simultaneously. Indeed, the algorithm must be rescaled by a different factor to identify the asymptotic process associated to X1.

6 2 Theorem 2. Consider a RWM algorithm with proposal distribution N (x, ` In/n) used to sample from a target density π as in (1). Suppose that π satisfies the conditions on f1 and f specified in Section 2, and that X(n)(0) is distributed according to π in (1). (n) For i = 2, . . . , n, we have {Wi (t); t ≥ 0} ⇒ {Wi(t); t ≥ 0}, where W1(0) and Wi(0) are distributed according to the densities f1 and f respectively. Conditionally on W1(t), the evolution of {Wi(t); t > 0} over an infinitesimal interval dt satisfies

1/2 1 ∂ dWi(t) = υ (`, W1(t))dB(t) + υ(`, W1(t)) log f (Wi(t)|W1(t)) dt, (5) 2 ∂Wi(t) with

2 h  ` 1/2 i υ(`, x1) = 2` EZ1 Φ − 2 γ (x1,Z1) , (6) √ Z1 = n(Y1 − x1)/` ∼ N (0, 1), and

 2 h 2i γ(x , z ) = z2 ∂ log f(X|x ) + ∂ log f(X|x ) . (7) 1 1 1 EX ∂x1 1 EX ∂X 1

Proof. See Appendix A.2.

Equation (5) describes the random behaviour of the process at the next instant (t+dt), given its position at t. This expression should not come as a surprise: each rescaled component Xi (i = 2, . . . , n) asymptotically behaves according to a diffusion process that is Markovian with (W ,W ) respect to F 1 i (t). Examination of (5) also tells us that f(Wi(t)|W1(t)) is invariant for this diffusion process (see [19], for instance).

Since X1 and X2 must be rescaled by different factors in order to be studied, the asymptotic behaviour of these components cannot be expressed as a bivariate diffusion process, as one might have expected. To obtain such a diffusion, we should rely on inhomogeneous proposal variances in order to ensure that X1 also mixes in O(n) iterations; in other words, we should 2 2 2 2 2 use a proposal covariance matrix of the form σ1(n) = (` /n , ` /n, . . . , ` /n)In. This frame- work would of course be suboptimal as it restrains the X1 movements. Proposed jumps for X1 would then become insignificant, and so the first term in (7) would be null.

Remark 3. Studying the limiting behaviour of X1 and Xi (i = 2, . . . , n) separately does not result in loss of information. In fact, studying the paths of X1,X2 simultaneously would require letting the test function h of the generator in (A.3) be a function of X1,X2. Such a generator would however be developed as an expression in which cross-derivative terms (e.g. 2 ∂ h(x , x )) are null. Therefore, given the current state of the asymptotic process, the ∂x1∂x2 1 2 next one-dimensional moves are performed independently for each component.

The limiting processes in Theorems 1 and 2 indicate that the component X1 explores its conditional state space at a different (higher) rate than X2:n. Added to the specific Markovian ˜ form of the limiting processes obtained (with respect to F W(t) and F (W1,Wi)(t) respectively), this points towards the need of updating X1 and X2:n separately, assessing the superiority of RWM-within-Gibbs samplers for sampling from hierarchical targets. These algorithms update blocks of components successively, a design that allows fully exploiting the characteristics of

7 the target considered. To our knowledge, this is the first time that asymptotic results are used to theoretically validate the superiority of RWM-within-Gibbs over RWM samplers for hierarchical target distributions. This theoretical superiority is obviously tempered in practice by an increased computational effort; the extent of this computational overhead is however difficult to quantify in full generality.

3.1 Optimal tuning of the RWM algorithm

Given that there are two different types of limiting processes and the algorithm either accepts or rejects all n individual candidates in a given step, we now face the dilemma as to which process should be chosen to optimize the exploration of the state space. To be confident that the n-dimensional chain has entirely explored its state space, we must be certain that every one-dimensional path has explored its state space completely. Since n − 1 of the n components conditionally mix according to O(n) while only one component conditionally mixes according to O(1), optimal mixing of the sampler shall be attained by optimizing the mixing of the components Xi, i = 2, . . . , n. For i = 2, . . . , n, the only quantity depending on the proposal variance (i.e. on `) in the limit is υ(`, W1(t)). It would thus make sense to optimize the mixing rate of any of these one-dimensional processes by choosing the diffusion process that goes the fastest (so the value of ` for which the speed measure υ(`, W1(t)) is optimized). The speed measure in (6) is quite intuitive; it is in fact similar to that obtained when studying i.i.d. target densities. The main difference lies in the form of γ(x, z) which, in the i.i.d. case, is ∂ 2 given by the constant term γ = E[( ∂X log f(X)) ]. The second term in (7) is thus equivalent to γ, and consists in a measure of roughness of the conditional density f(x2|x1) under a variation of x2 (say). In the case of hierarchical target distributions, we find an extra term that might be viewed as a measure of roughness of f(x2|x1) under a variation of x1. This term 2 is weighted by z1, the square of the candidate standardized increment for the first component; in other words, the further the candidate y1 is from the current value x1, the greater impact has the associated measure of roughness. Of course, in optimizing the speed measure function, we do not need to know in advance the exact value of the candidate standardized increment z1; the speed measure averages over this quantity. It is interesting to note that optimizing the speed measure leads to a local proposal variance of 2 the form `ˆ (W1(t))/n. Such a proposal variance would then be used for proposing a candidate at the next instant t + dt, given the position of the mixing component at time t. This local proposal variance thus varies from one iteration to another, by opposition to usual tunings in the literature that are fixed for the of the algorithm. Naturally, if both expectations in (7) are constant with respect to x1, the proposal variance obtained by maximizing the speed measure is constant.

Remark 4. The asymptotic process introduced in Theorem 2 naturally leads us to the concept of local proposal variances. It is however unclear whether the local tunings obtained by maxi- mizing (6) really optimize the mixing rate of the algorithm. Indeed, the proof of Theorem 2 is carried out with `2 constant; this allows, among other things, relying on the simplified form for the acceptance probability. In order to claim that the local proposal variances obtained are optimal, a weak convergence result would need to be proved using a general proposal variance

8 2 2 of the form σ (n, x1) = ` (x1)/n. This extension is not trivial, as the ratio of proposal den- sities qn(x; y)/qn(y; x) would then need to be included in the acceptance probability. Since the concept of locally optimal proposal variances is numerically demanding in the current framework, we choose to focus on `2 constant. Remark 5. It turns out that local proposal variances optimizing (6) are bounded above 1/2 ∂ 2 by 2.38/EX [( ∂X log f(X|x1)) ], the asymptotically optimal scaling (AOS) value under the i.i.d. assumption given that X1 = x1. To realize this, assume that X1 is fixed across it- erations; we then find ourselves in an i.i.d. setting and the associated speed measure is 2 ` 1/2 ∂ 2 2` Φ(− 2 EX [( ∂X log f(X|x1)) ]). The upper bound then follows from the fact that the func- tion Φ(·) in (6) decreases faster in ` than the function Φ(·) in the above expression.

Relying on `ˆ(x1) to propose a candidate for the next time interval is usually time-consuming, as it involves numerically solving for the appropriate local proposal variance at every iteration. Since the process is assumed to start in stationarity and X1 mixes faster than the other components, we might determine a value `ˆ that is fixed for the duration of the algorithm by integrating the speed measure υ(`, ·) over all possible values for X1 (with respect to the distribution f1(·) for x1). The global AOS value `ˆ maximizes the function   `  [υ (`, X )] = 2`2 Φ − γ1/2(X ,Z ) EX1 1 EX1,Z1 2 1 1 Z Z  `  = 2`2 Φ − γ1/2(x , z ) φ(z )f (x ) dz dx , 2 1 1 1 1 1 1 1 X1 R where φ(·) is the probability density function of a standard normal random variable; the solution to this equation provides the best AOS value unconditionally. The global AOS value might be reexpressed in terms of an asymptotically optimal acceptance rate; this means that rather than selecting a specific scaling value, one might instead monitor the acceptance rate in order to work with an optimally mixing version of the RWM algorithm. To work in terms of acceptance rates, we introduce the expected acceptance rate of the n- dimensional stationary RWM algorithm with a normal proposal: ZZ  ` −n y − x a (`) = α(x, y) √ φ √ π(x) dy dx, n n n `/ n

where φn(·) stands for the probability density function of an n-dimensional standard normal random variable. Results about optimal mixing of the RWM algorithm are summarized in the following corollary. Corollary 6. In the settings of Theorem 2, the global asymptotically optimal scaling value `ˆ maximizes Z Z  `  2`2 Φ − γ1/2(x, z) φ(z)f (x) dz dx. 2 1 X1 R Furthermore, we have that Z Z   ` 1/2 lim an(`) = a(`) ≡ 2 Φ − γ (x, z) φ(z)f1(x) dz dx n→∞ 2 X1 R and the corresponding asymptotically optimal acceptance rate is given by a(`ˆ).

9 In contrast to the i.i.d. case, the AOAR found is not independent of the densities f1 and f. Therefore there is not a huge advantage in choosing to tune the acceptance rate of the algorithm over the proposal variance; in fact, both approaches involve the same effort. Although it would also be possible to compute an overall acceptance rate associated to using local proposal variances, it could not be used to tune the algorithm. Building an optimal Markov chain based on local proposal variances would imply modifying the proposal variance at every iteration, which cannot be achieved by solely monitoring the acceptance rate. For simplicity, the theoretical results in this section attribute identical proposal variances to all n components. In practice, when a RWM algorithm is used to sample from a hierarchical target, users will likely want to use a different proposal variance for the mixing component. In fact there is nothing, in the proofs of Theorems 1 and 2, that prevents us from relaxing the initial framework to include inhomogeneous proposal variances.

2 2 2 Corollary 7. Let Y1 ∼ N (x1, ` κ1/n) with 0 < κ1 < ∞ and Y2:n ∼ N (x2:n, ` In−1/n), where Y1, Y2:n are of course independent. Then, Theorems 1 and 2 hold as stated, except ˜ 2 2 that the limiting proposal distribution in Theorem 1 is Y1 ∼ N (˜x1, ` κ1) and the random 2 variable Z1 in Theorem 2 is such that Z1 ∼ N (0, κ1).

In this paper, we consider a simple yet useful hierarchical model, which is formed of one mixing component X1, and where the remaining components are conditionally i.i.d. given X1. This is a natural starting point to study weak convergence of RWM algorithms for hi- erarchical target distributions, and even for correlated target distributions in general (used in conjunction with proposal distributions featuring independent components). There exist many generalizations of (1), just as there exist many generalizations of the proposal distri- bution considered. Although some extensions of the hierarchical model in (1) are considered in the discussion, we do not aim at presenting a detailed treatment of these cases.

4 Numerical studies

To illustrate the results of Section 3, we consider two toy examples: the first target distribu- tion considered is a normal-normal hierarchical model in which the components X2,...,Xn are related through their mean, while the second one is a gamma-normal hierarchical model in which X2,...,Xn are related through their variance.

4.1 Normal-normal hierarchical distribution

Consider an n-dimensional hierarchical target distribution such that X1 ∼ N (0, 1) and Xi|X1 ∼ N (X1, 1) for i = 2, . . . , n. To sample from this target model, we use a RWM 2 algorithm with a N (x, ` In/n) proposal distribution. As shall be seen below, this simple target allows us to relate Theorem 2 to the theoretical results derived in [3]. Pn Standard calculations lead to X1|X2:n ∼ N ( i=2 Xi/n, 1/n); as n → ∞, V ar(X1|X2:n) → Pn ˜ 1/2 ˜ 0 almost surely. If we let µn = i=2 Xi/n and X1 = n (X1 − µn), then X1|X2:n ∼ ˜ 1 Pn √X1 2 1 Pn 2 1 Pn 2 N (0, 1). Furthermore, n i=2(Xi −µn − n ) = n i=2(Xi −X1) = n i=2 Zi converges in 2 R ∂ 2 probability to E[Z ] = ( ∂x log f(x|µ)) f(x|µ)dx = 1, where Z1,...,Zn denote independent

10 standard normal random variables. By Theorem 1, we can thus affirm that the component X˜1 asymptotically behaves according to a one-dimensional RWM algorithm with a standard normal target and acceptance function as in (4); these do not, in the current case, depend on x. 2 Evaluating the function γ(x1, z1) in (7) is a simple task and leads to γ(x1, z1) = z1 + 1. The AOS value might then be found by maximizing

  ` q  υ(`) = 2`2 Φ − Z2 + 1 EZ1 2 1

2 with respect to `, where Z1 ∼ N (0, 1). This yields an AOS value `ˆ = 4.00 and a corre- sponding AOAR of υ(`ˆ)/`ˆ2 = 0.205. These values are of course less than those obtained for a target with i.i.d. components (5.66 and 0.234 respectively); indeed, the proposal distribution is formed of i.i.d. components and is thus better suited for similar targets. This proposal distribution is nonetheless widely used in MCMC applications; relying on a proposal with correlated components would require a certain understanding of the correlation structure among the target components and is not really suitable in our general framework. It is worth pointing out that the speed measure of the limiting diffusion process does not depend on X1 in the case. This holds for arbitrary densities f1 and f satisfying the conditions in Section 2, provided that X1 is a location parameter for the Xis. Since a variation in the location parameter does not impact the roughness of the corresponding distribution, the AOS and AOAR found are valid both locally and globally. This means that `ˆ, which remains fixed across iterations, is the best possible proposal scaling conditionally on the last position of the component X1 (i.e. `ˆ= `ˆ(x1)). A second peculiarity of this example is that the target distribution is jointly normal with 2 2 mean 0 and n × n covariance matrix Σn given by σ1 = 1, σj = 2 (j = 2, . . . , n), and σi,j = 1 ∀i 6= j (i, j = 1, . . . , n). Normal distributions being invariant under orthogonal transformations, we can find a transformation under which the target components become mutually independent. The covariance matrix Σn is thus transformed into a diagonal matrix whose diagonal elements consist in the eigenvalues of Σn. In large dimensions, the eigenvalues might be approximated by 1/(n + 1), (n + 1), 1,..., 1. It turns out that the optimal scaling problem for target distributions of this sort (i.e. formed of components that are i.i.d. up to a scaling term) has been studied in [1]. Solving for the AOS value and AOAR of the transformed target using Theorem 1 and Corollary 2 in [3] leads to values which are consistent with those obtained using Theorem 2 in Section 3 (`ˆ2 = 4 and AOAR= 0.205). To illustrate these theoretical results, we consider a 20-dimensional normal-normal target distribution as described above and we run 50 RWM algorithms that differ by their proposal variance only. For each algorithm, we perform 100,000 iterations and measure efficiency by recording the average squared jumping distance

N n 1 X X  (n) (n) 2 ASJD = x [j] − x [j − 1] ; N i i j=1 i=1 here, N is the number of iterations and n is the dimension of the target distribution as before.

11 We also record the average acceptance rate of each algorithm, expressed as

N 1 X AAR = 1{x(n)[j] 6= x(n)[j − 1]}. N j=1

We repeat these steps for 50- and 100-dimensional normal-normal target densities, and com- bine all three curves of efficiency versus acceptance rate on a graph along with the theoretical efficiency curve of υ(`) versus the expected acceptance rate υ(`)/`2 (Figure 1, right graph, bottom set of curves). In order to assess the limiting behaviour of the component X1, we also plot the ASJD of this single component (for the 20-, 50-, and 100-dimensional cases) along with the ASJD for the limiting one-dimensional RWM algorithm described in Theorem 1 (Figure 1, left graph, top set of curves). We now repeat the numerical experiment by taking advantage of the known target marginal 2 variances in the tuning of the proposal distribution. Specifically, we let Y1 ∼ N (x1, ` /2n) 2 be independent of Y2:n ∼ N (x2:n, ` /n) and run the RWM algorithm in dimensions 20, 50, and 100. The resulting simulated and theoretical efficiency curves are illustrated in Figure 1 (left graph, bottom set of curves; right graph, top set of curves). Although efficiency curves for X1 are lower when using inhomogeneous proposal variances, this approach still results in a better overall performance (efficiency curves on the right graph are higher than with homogeneous variances). The optimized theoretical efficiency is 0.974, which is related to an AOAR of 0.221. In both cases, the graphs of the trace point towards an exploration of the entire state space in O(n2) iterations. Despite the fact that the results proven in Theorems 1 and 2 are valid asymptotically, the simulation study yields efficiency curves that are very close together and thus the conclusions of the theorems seem applicable in relatively low-dimensional settings. Each set of curves on the graph agrees about the optimal acceptance rates 0.221 and 0.205 respectively. The shape of the efficiency curve is however more important than the exact value of the AOAR. Tuning the acceptance rate of the algorithm anywhere between 0.15 and 0.3 would yield a loss of at most 10% in efficiency, and would still result in a Markov chain that rapidly explores its state space; in particular, using the usual 0.234 for this target distribution would yield an almost optimal algorithm. Beyond finding the exact AOAR for a specific target distribution, there is thus a need of understanding when and why are AOARs significantly different from 0.234 or, in other words, in which cases it becomes necessary to solve for the exact AOAR. At the present time, the only way of answering this question is by solving the optimal scaling problem for target distributions of interest.

4.2 Gamma-normal hierarchical distribution

As a second example, consider a gamma-normal hierarchical target distribution where X1 ∼ Γ(α, λ) and Xi|X1 ∼ N (0, 1/X1) for i = 2, . . . , n. Although the Xis are still normally distributed, the component X1 now acts through the variance of the normal random variables. This results in a target distribution that significantly differs from the distribution considered in the previous section, falling slightly outside the framework of Section 2 (the function ∂ log f(x|x ) being only locally Lipschitz continuous). We apply the usual RWM algorithm ∂x1 1 to obtain a sample from this distribution.

12 1.0 0.40 0.8 0.35 0.6 0.30 Efficiency Efficiency

0.25 Theoretical Theoretical 0.4 n=100 n=100 n=50 n=50 n=20 n=20 0.20 0.2 0.15 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Acceptance Rate Acceptance Rate

Figure 1: Efficiency of RWM algorithm against acceptance rate for the normal-normal hier- archical target. Left: efficiency of X1 only; the top set of curves corresponds to homogeneous proposal variances. Right: efficiency of all n components; the top set of curves now corre- sponds to inhomogeneous proposal variances.

Pn 2 It is easy to verify that X1|X2:n ∼ Γ(α+(n−1)/2, λ+ i=2 Xi /2); as n → ∞, V ar(X1|X2:n) →p ˜ 1 Pn √X1 2 2 0. The WLLN-type condition of Theorem 1 is satisfied since n i=2(µn + n ) Xi = ˜ √X1 1 Pn 2 ∂ 2 (µn + n )( n i=2 Zi ) converges in probability to µ(X) = EX [( ∂X log f(X|µ)) ], where Z1:n are independent standard normal random variables. Using Stirling’s formula, it is not diffi- 2 cult to show that the density of X˜1|X2:n converges almost surely to that of a N (0, 2/µ (X)). By Theorem 1 the component X1 asymptotically behaves according to an atypical one- dimensional RWM algorithm with a normal target distribution whose variance varies from iteration to iteration, and with an acceptance function as in (4) that varies accordingly. To optimize the efficiency of the algorithm, we analyze the speed measure in (6) which, in the present case, is expressed as " r !# ` 2 2 1 Z1 υ(`, x1) = 2` EZ1 Φ − 2 2 + x1 , 2 x1

where Z1 ∼ N (0, 1). Maximizing the function EX1 [υ(`, X1)] with respect to ` leads to the global AOS value, which is fixed across iterations; when (α, λ) = (3, 1) for instance, we find `ˆ2 = 2.40 and AOAR = 0.204. The simulation study described in Section 4.1 has been performed for the gamma-normal target model with various α and λ. Specifically, for fixed α, λ, we consider a 10-dimensional gamma-normal target distribution and run 50 RWM algorithms possessing their own proposal variance. For each algorithm, we perform 1,000,000 iterations and measure efficiency by recording the ASJD of each chain. We then repeat these steps for 20- and 50-dimensional targets. Table 1 presents the optimal efficiency and acceptance rate for various α, λ. Those

results are compared to the theoretical optimal values obtained by maximizing EX1 [υ(`, X1)]. Although the corresponding graphs are omitted here, they yield curves similar to those ob-

13 Table 1: Optimal efficiency and acceptance rate of chains in various dimension (n = 10, 20, 50), for different parameters α, λ of the gamma distribution for X1. The theoreti- cal optimal efficiency and acceptance rate are also included for comparison.

Optimal efficiency Optimal acceptance rate Parameters Theoretical n = 10 n = 20 n = 50 Theoretical n = 10 n = 20 n = 50

α = 2, λ = 1 0.6381 0.6036 0.6246 0.6456 0.1934 0.1984 0.1968 0.1857 α = 2, λ = 2 0.8169 0.7430 0.7862 0.8239 0.1815 0.1888 0.1759 0.1886 α = 2, λ = 3 0.8420 0.7623 0.8170 0.8608 0.1517 0.1682 0.1527 0.1593 α = 3, λ = 1 0.4889 0.4503 0.4736 0.4926 0.2037 0.2370 0.2158 0.2001 α = 3, λ = 2 0.7541 0.6739 0.7139 0.7405 0.2038 0.2265 0.2233 0.2040 α = 3, λ = 3 0.8648 0.7554 0.8075 0.8497 0.1922 0.1931 0.1930 0.1882

tained in Figure 1 for the normal-normal target distribution. The graphs of the trace however points towards an exploration of the state space in O(n3) iterations. We note that even if the gamma-normal departs from a jointly normal distribution assumption and does not yield as nice a target distribution as in the previous example, the AOAR obtained is not too far from the 0.234 found for i.i.d. target distributions. The AOAR however tends to decrease as λ increases (0.152 when (α, λ) = (2, 3)).

5 Discussion

In this paper, we have studied the tuning of RWM algorithms applied to single-level hierar- chical target distributions. The optimal variance of the Gaussian proposal distribution has been found to depend on a measure of roughness of the density f with respect to x as before, but also with respect to the mixing parameter x1. This leads to local proposal variances that are a function of the mixing parameter x1. It is however possible to average over the random variable X1 to find a globally optimal proposal variance. In the case where X1 is a location parameter, it does not affect the roughness of the density f and the optimal proposal scaling is valid both locally and globally. Although not of direct use here, we expect the concept of locally optimal proposal variances to be of interest with other types of samplers, such as RWM-within-Gibbs algorithms. Indeed, it is interesting to outline that the specific asymp- totic results obtained provide a theoretical justification for the use of RWM-within-Gibbs when sampling from hierarchical targets. Higher-level hierarchies could be studied using a similar approach. A target featuring p mixing components, expressed as

p n Y Y π (x) = fj (xj) f (xi |x1:p ) , j=1 i=p+1

14 with x1:p = (x1, . . . , xp) would lead to a result similar to Theorem 2, but with the function

2  p   X h 2i γ(x , z ) = z ∂ log f(X|x ) + ∂ log f(X|x ) , 1:p 1:p EX  j ∂xj 1:p   EX ∂X 1:p j=1 where z1:p = (z1, . . . , zp) come from independent N (0, 1) random variables. For a tar- get whose mixing component (Xp say) depends itself on higher-level mixing components Qn X1,...,Xp−1, π (x) = f1 (x1:p) i=p+1 f (xi |xp ), the conclusions of Theorem 2 are still valid. These generalizations also hold for Corollary 7, with obvious adjustments (Z1 ∼ 2 2 2 2 N (0, ` κ1/n),...,Zp ∼ N (0, ` κp/n)). Similar extensions may be derived for other hierar- chical models. In the simple examples tested, we have found that the optimal acceptance rate most often lies around 0.2. In the gamma-normal example, there were some values of α, λ that led to significantly lower optimal acceptance rates (0.15 when α = 2, λ = 3). The usual 0.234 is thus quite robust and, if preferred, should lead to an efficient version of the sampler. In the case of correlated targets, it would however be wiser to settle for an acceptance rate slightly below 0.234. Since we investigate correlated targets with a proposal distribution featuring a diagonal covariance matrix, it is not surprising to find an AOAR lower than 0.234; the latter is the AOAR for exploring a target distribution with independent components, which is an ideal situation when relying on a proposal distribution with independent components.

Acknowledgements

The author has received the support of the National and Engineering Research Council of Canada for pursuing this research project, and wishes to thank P. Gagnon for useful comments that improved the manuscript.

References

[1] B´edard,M. (2006). On the Robustness of Optimal Scaling for Random Walk Metropolis Algorithms. Ph.D. dissertation, Department of Statistics, University of Toronto.

[2] B´edard,M. (2007). Weak convergence of Metropolis algorithms for non-iid target dis- tributions. Ann. Appl. Probab. 17, 1222-1244.

[3] B´edard,M. (2008). Optimal acceptance rates for Metropolis algorithms: Moving beyond 0.234. Stochastic Process. Appl. 118, 2198-2222.

[4] B´edard,M., Douc, R., and Moulines, E. (2012). Scaling analysis of multiple-try MCMC methods. Stochastic Process. Appl. 122, 758-786.

[5] Beskos, A., Pillai, N.S., Roberts, G.O., Sanz-Serna, J., and Stuart, A.M. (2013). Optimal tuning of the hybrid Monte Carlo algorithm. Bernoulli. 19, 1501-1534.

15 [6] Beskos, A., Roberts, G.O., and Stuart, A.M. (2009). Optimal scalings for local Metropolis-Hastings chains on non-product targets in high dimensions. Ann. Appl. Probab. 19, 863-898.

[7] Billingsley, P. (1968). Convergence of Probability Measures. John Wiley & Sons.

[8] Ethier, S.N. and Kurtz, T.G. (1986). Markov Processes: Characterization and Conver- gence. Wiley.

[9] Hastings, W.K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 57, 97-109.

[10] Mattingly, J.C., Pillai, N.S., and Stuart, A.M. (2012). Diffusion limits of the random walk Metropolis algorithm in high dimensions. Ann. Appl. Probab. 22, 881-930.

[11] Mengersen, K. and Tweedie, R.L. (1996). Rates of convergence of the Hastings and Metropolis algorithms. Ann. Statist. 24, 101-121.

[12] Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., and Teller, E. (1953). Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087- 1092.

[13] Neal, P. and Roberts, G.O. (2006). Optimal scaling for partially updating MCMC algo- rithms. Ann. Appl. Probab. 16, 475-515.

[14] Neath, R.C. and Jones, G.L. (2009). Variable-at-a-time implementations of Metropolis- Hastings. arXiv:0903.0664v1

[15] Roberts, G.O., Gelman, A., and Gilks, W.R. (1997). Weak convergence and optimal scaling of random walk Metropolis algorithms. Ann. Appl. Probab. 7, 110-120.

[16] Roberts, G.O. and Rosenthal, J.S. (2001). Optimal scaling for various Metropolis- Hastings algorithms. Statist. Sci. 16, 351-367.

[17] Roberts, G.O. and Tweedie, R.L. (1996). Geometric convergence and central limit theo- rems for multidimensional Hastings and Metropolis algorithms. Biometrika. 83, 95-110.

[18] Sherlock, C. and Roberts, G.O. (2009). Optimal scaling of the random walk Metropolis on elliptically symmetric unimodal targets. Bernoulli. 15, 774-798.

[19] Xifara, T., Sherlock, C., Livingstone, S., Byrne, S. and Girolami, M. (2014). Langevin diffusions and the Metropolis-adjusted Langevin algorithm. Stat. Probabil. Lett.91, 14- 19.

A Appendix : Proofs of theorems

We now proceed to prove Theorems 1 and 2. To assess weak convergence of the processes ˜ (n) (n) {W1 (t); t ≥ 0} and {W2 (t); t ≥ 0} in the Skorokhod topology, we first verify weak con- vergence of finite-dimensional distributions. Whereas these processes are not themselves

16 W˜ (n) (W (n),W (n)) Markovian, they are F (t)-progressive and F 1 i (t)-progressive R-valued processes respectively, and the aim of this section is to establish their convergence to some Markov pro- cesses. According to Theorem 8.2 of Chapter 4 in [8], we thus look at the pseudo generator of ˜ (n) (n) {W1 (t); t ≥ 0} (resp. {W2 (t); t ≥ 0}), the univariate process associated to the component X1 (resp. X2) in the rescaled RWM algorithm introduced at the end of Section 2. We then verify L1-convergence to the generator of the special RWM sampler with acceptance rule (4) (resp. the generator of the diffusion in (5)). To complete the proofs, Theorem 7.8 of Chapter 3 in [8] says that we must also assess the ˜ (n) (n) relative compactness of {W1 (t); t ≥ 0} and {W2 (t); t ≥ 0}) for n = 2, 3,..., as well as the existence of a countable dense set on which the finite-dimensional distributions weakly converge. This is achieved by using Corollary 8.6 of Chapter 4 in [8]; in the setting of Theorem 1, satisfaction of applicability conditions is immediate; in the setting of Theorem 2, satisfaction of the first condition is immediate, while the verification of the second condition is briefly discussed in Section A.2.

A.1 Proof of Theorem 1

˜ (n) ˜ (n) In Theorem 1, it is assumed that {W1 (t); t ≥ 0} is the component of interest in {W (t); t ≥ ˜ (n) 0}. Define the pseudo generator of {W1 (t); t ≥ 0} as

h (n) i ˜ ˜ (n) ˜ (n) ˜ (n) W˜ Gnh(W1 (t)) = E h(W1 (t + 1)) − h(W1 (t)) F (t) ,

(n) (n) where h is an arbitrary test function. By setting ξn(t) = h(W˜ (t)) and ϕn(t) = G˜nh(W˜ (t)), 1 h 1 i ˜ ˜ (n) ˜ ˜ conditions in part (c) of Theorem 8.2 (Chap. 4 in [8]) reduce to E Gnh(W1 (t)) − Gh(W1(t)) → 0 as n → ∞ for h ∈ C (the space of continuous and bounded functions on R), where Gh˜ (W˜ 1(t)) is the generator of the special RWM sampler described in Theorem 1. h i ˜ ˜ ˜ ˜ The above may be reexpressed as E Gnh(X1) − Gh(X1) → 0 as n → ∞, where h  h ii G˜ h(˜x ) = h(Y˜ ) − h(˜x ) α(x˜(n), Y˜ (n)) n 1 EY˜1 1 1 EY2:n

(n) ˜ (n) (n) √1 √x˜1 with x˜ = (˜x1, x2, . . . , xn) and similarly for Y ; the density of x˜ is n π(µn + n , x2:n) √ π(µ +Y˜ / n,Y ) with π as in (1), and therefore α(x˜(n), Y˜ (n)) = 1 ∧ n 1 √ 2:n . Furthermore, π(µn+˜x1/ n, x2:n) h  i Gh˜ (˜x ) = h(Y˜ ) − h(˜x ) α∗(˜x , Y˜ |x) 1 EY˜1 1 1 1 1

with α∗ as in (4). Note that there is a slight abuse of notation as, although h is a function (n) of x1 only, the generator G˜nh(˜x1) is a function of x˜ ; a similar remark holds for Gh˜ (˜x1). We now proceed to verify this condition. Hereafter, we use →a.s., →p, and →d to denote convergence almost surely, in probability, and in distribution. In the current context where there is no time-rescaling factor, the limiting process shall

17 remain a RWM algorithm. For h ∈ C and some K > 0, the triangle inequality implies h i h i ˜ ˜ ˜ ˜ ˜ (n) ˜ (n) ˜ (n) ˜ (n) E Gnh(X1) − Gh(X1) ≤ K E α(X , Y ) − α2(X , Y ) (A.1) h i ˜ (n) ˜ (n) ˜ (n) ˜ (n) +K E α2(X , Y ) − α1(X , Y ) h h i i + K α (X˜ (n), Y˜ (n)) − α∗(X˜ , Y˜ |X) , E EY2:n 1 1 1

(n) (n) (n) (n) where the function α2(X˜ , Y˜ ) shall be defined in Lemma B.1 and α1(X˜ , Y˜ ) = n (n) (n) o 1 ∧ exp ε1(X˜ , Y˜ ) . Here,

˜ Y1 n f1(µn + √ |x2:n) (n) (n) n X ∂ x˜1 ε1(x˜ , Y˜ ) = log + log f(x|µn + √ ) (Yi − xi) √x˜1 ∂x n x=x f1(µn + n |x2:n) i=2 i `2 n  2 X ∂ x˜1 − log f(x|µn + √ ) , (A.2) 2n ∂x n x=x i=2 i

√1 √x˜1 ˜ with n f1(µn + n |x2:n) representing the conditional density of X1 given x2:n. By Lemmas B.1 and B.2, the first and second terms in (A.1) respectively converge to 0 as 2 n → ∞; in the sequel, we thus study the last term. Since Y2:n ∼ N (x2:n, ` In−1/n), the second and third terms on the right of (A.2) are normally distributed with mean M and 2 2   ` Pn ∂ √x˜1 variance V , where V = −2M = n i=2 ∂x log f(x|µn + n ) . x=xi By assumption, this variance term converges in probability to `2γ˜(µ); hence, the last two terms on the right of (A.2) converge in probability to a N (−`2γ˜(µ)/2, `2γ˜(µ)). Regularity conditions allow us to invoke the (multivariate) Continuous Mapping Theorem, which implies

( ˜ 2 !) (n) (n) g1(Y1|X) ` 2 α1(X˜ , Y˜ ) →p 1 ∧ exp N log − γ˜(µ), ` γ˜(µ) . g1(X˜1|X) 2

Proposition 2.4 in [15] then claims that the expectation of 1 ∧ exp{Z}, where Z is the ∗ normal random variable just introduced, is equal to α (X˜1, Y˜1|X). The Bounded Convergence Theorem can then be used to conclude that the last term in (A.1) converges to 0 as n → ∞.

A.2 Proof of Theorem 2

(n) In Theorem 2, it is assumed that {Wi (t); t ≥ 0} (i = 2, . . . , n) is the component of interest in the rescaled process {W(n)(t); t ≥ 0}. Without loss of generality, fix i = 2 and define the (n) pseudo generator of {W2 (t); t ≥ 0} as

(n) h (n) (n) (W (n),W (n)) i G h(W (t)) = n h(W (t + 1 )) − h(W (t)) F 1 2 (t) , n 2 E 2 n 2 where h is an arbitrary test function. (n) (n) By setting ξn(t) = h(W2 (t)) and ϕn(t) = Gnh(W2 (t)), part (c) of Theorem 8.2 (Chapter (n) 4 in [8]) reduces to the conditions supn sups≤T E[|Gnh(W2 (s))|] < ∞ for T > 0 and h ∈ C,

18 h i (n) and E Gnh(W2 (t)) − Gh(W2(t)) → 0 as n → ∞ for h ∈ C, where Gh(W2(t)) is the generator of the diffusion process described in Theorem 2.

Hereafter, we use the notation Y1,3:n = (Y1,Y3,...,Yn). The latter condition may be reex-

pressed as EX1:2 [|Gnh(X2) − Gh(X2)|] → 0 as n → ∞, where

h h (n) (n) ii Gnh(X2) = nEY2 (h(Y2) − h(X2)) EX3:n,Y1,3:n α(X , Y ) (A.3)

(n) with α(x(n), Y(n)) = 1 ∧ π(Y ) and π as in (1), and π(x(n))   1 00 1 ∂ 0 Gh(X2) = υ(`, X1) h (X2) + log f(X2|X1)h (X2) . (A.4) 2 2 ∂X2

There is again a slight abuse of notation as, although h is a function of x2 only, the generators Gnh(x2) and Gh(x2) are functions of x1, x2. Due to the form of (A.4), we can resort to ∞ Theorem 2.1 of Chapter 8 in [8] to assert that Cc , the space of continuous and infinitely differentiable functions that are compactly supported on R, forms a core for the generator of the diffusion in Theorem 2. The test function h in (A.3) and (A.4) might then be restricted ∞ to functions h belonging to Cc . (n) We note that the condition supn sups≤T E[|Gnh(W2 (s))|] < ∞ for T > 0 and h ∈ C may ∞ be reexpressed as supn E[|Gnh(X2)|] < ∞ for h ∈ Cc . In fact, it is straight-forward to verify 2 −1 that E[(Gnh(X2)) ] ≤ Kh + O(n ) for some Kh ∈ (0, ∞) which implies that the former is satisfied (this is achieved by considering a function similar to (B.6), in which the acceptance function is Taylor expanded to first order only, and by proceeding as in the proof of Lemma B.3). It also implies the satisfaction of the second applicability condition of Corollary 8.6 2 ∞ (Chapter 4 in [8]), which may be simplified as lim supn→∞ E[(Gnh(X2)) ] < ∞ for h ∈ Cc . 1 We now proceed to verify that Gnh(X2) converges in L to Gh(x2). To begin, we have from h i Lemma B.3 that G h(X ) − G(1)h(X ) → 0 as n → ∞, where G(1)h(x ) is the EX1:2 n 2 n 2 n 2 generator of a diffusive process :

2 h i (1) ` 00 (n) (n) 1 Gn h(X2) = h (X2)EX3:n,Y1,3:n α(X , Y ) X1 (Y1) 2 X2 h i +`2h0(X ) ∂ log f(X |X ) g(X(n), Y(n))1 (Y ) . (A.5) 2 ∂X2 2 1 EX3:n,Y1,3:n X2 X1 1

Note that it is not necessary to precise this expression further at this stage. Now, we have h i G(1)h(X ) − Gh(X ) ≤ EX1:2 n 2 2 h 2 h (n) (n) i i K1 X ` X ,Y α(X , Y )1X (Y1) − υ(`, X1) (A.6) E 1:2 E 3:n 1,3:n X2 1  h i  ∂ 2 (n) (n) 1 1 + K2 EX1:2 | log f(X2|X1)| ` EX3:n,Y1,3:n g(X , Y ) X1 (Y1) − υ(`, X1) ∂X2 X2 2

∞ 0 00 for some K1,K2 > 0, since h ∈ Cc and thus |h | and |h | are bounded.

19 Using the triangle inequality, the first term on the RHS of (A.6) satisfies

h 2 h (n) (n) i i X ` X ,Y α(X , Y )1X (Y1) − υ(`, X1) E 1:2 E 3:n 1,3:n X2 1 2 h (n) (n) (n) (n) i ≤ ` X ,Y α(X , Y ) − αˆ(X , Y ) 1X (Y1) E 1:n 1,3:n X2 X2 1 h 2 h (n) (n) i i + X ` X ,Y αˆ(X , Y )1X (Y1) − υ(`, X1) , E 1:2 E 3:n 1,3:n X2 1 where the functionα ˆ is as in Lemma B.4. Using Lemmas B.4 and B.5, the above converges to 0 as n → ∞. It thus only remains to verify that the second term on the right hand side of (A.6) also converges to 0; Lemma B.6 leads us to that conclusion.

B Appendix : Intermediate results

h i ˜ (n) ˜ (n) ˜ (n) ˜ (n) Lemma B.1. As n → ∞, we have E α(X , Y ) − α2(X , Y ) → 0, with α as in (n) (n) n (n) (n) o Appendix A.1 and α2(X˜ , Y˜ ) = 1 ∧ exp ε2(X˜ , Y˜ ) , with

˜ Y1 n f1(µn + √ |x2:n) (n) (n) n X ∂ Y˜1 ε2(x˜ , Y˜ ) = log + log f(x|µn + √ ) (Yi − xi) √x˜1 ∂x n x=x f1(µn + n |x2:n) i=2 i `2 n  2 X ∂ Y˜1 − log f(x|µn + √ ) . (B.1) 2n ∂x n x=x i=2 i

Proof. The acceptance function satisfies α(x˜(n), Y˜ (n)) = 1 ∧ exp{ε(x˜(n), Y˜ (n))}, where

˜ ˜  ˜  f (µ + √Y1 ) Qn f(x |µ + √Y1 ) n f(Y |µ + √Y1 ) 1 n n i=2 i n n X i n n ε(x˜(n), Y˜ (n)) = log + log . x˜1 Qn x˜1  Y˜  f1(µn + √ ) f(xi|µn + √ ) √1 n i=2 n i=2 f(xi|µn + n )

Applying obvious changes of variables allows us to express ε in terms of x(n) and Y(n) :

n f1(Y1|x2:n) X ε(x(n), Y(n)) = log + (log f(Y |Y ) − log f(x |Y )) . f (x |x ) i 1 i 1 1 1 2:n i=2

Using a second-order Taylor expansion with respect to Yi around xi (i = 2, . . . , n) to reexpress the last term on the right hand side (RHS) leads to

n (n) (n) f1(Y1|x2:n) X ∂ ε(x , Y ) = log + log f(xi|Y1)(Yi − xi) f (x |x ) ∂xi 1 1 2:n i=2 n 1 X ∂2 2 + 2 log f(Ui|Y1)(Yi − xi) 2 ∂Ui i=2

for some Ui ∈ (xi,Yi) or Ui ∈ (Yi, xi).

We note that a candidate Y1 that does not belong to X1 is automatically rejected by the ∗ algorithm, i.e. functions α, α2, α1, and α are identically 0. Applying changes of variables

20 (n) (n) to the function ε2(x˜ , Y˜ ) and using the Lispchitz property of 1 ∧ exp{·} along with the 2 fact that Yi ∼ N (xi, ` /n), i = 2, . . . , n yield h i h i α(X˜ (n), Y˜ (n)) − α (X˜ (n), Y˜ (n)) ≤ ε(X(n), Y(n)) − ε (X(n), Y(n)) 1 (Y ) E 2 E 2 X1 1 " n 2 n # 1 2 `  2 X ∂ 2 X ∂ 1 ≤ E 2 log f(Xi|Y1)(Yi − Xi) + ∂X log f(Xi|Y1) X1 (Y1) 2 ∂Xi 2n i i=2 i=2 2   ` n − 1 h 2 2 i ∂ ∂ 21 + E 2 log f(U2|Y1) − 2 log f(X2|Y1) Z2 X1 (Y1) , 2 n ∂U2 ∂X2 √ 1 where Z2 = n(Y2 − X2)/` ∼ N (0, 1), and X1 (y) = 1 if y ∈ X1 and 0 otherwise. From Proposition C.1 in Appendix C, the first term on the RHS converges to 0 as n → ∞. We now study the second term on the right. Since Y2 →a.s. x2, it implies that U2 →a.s. x2; from the ∂2 ∂2 Continuous Mapping Theorem, we have 2 log f(U2|Y1) − 2 log f(X2|Y1) →a.s. 0, for all ∂U2 ∂X2 Y1 ∈ X1. Furthermore,    2 2 2 ∂ ∂ 41 2 1 E 2 log f(U2|Y1) − 2 log f(X2|Y1) Z2 X1 (Y1) ≤ 12 E[K (Y1) X1 (Y1)] ∂U2 ∂X2 `2 ≤ 24 [(K(Y ) − K(X ))21 (Y )] + 24 [K2(X )] ≤ 24K∗ + 24 [K2(X )] < ∞ E 1 1 X1 1 E 1 n E 1 ∗ for some K > 0 (since K(x1) satisfies a Lipschitz condition). We conclude, by invoking the Uniform Integrability Theorem, that the second term converges to 0 as n → ∞.

h i ˜ (n) ˜ (n) ˜ (n) ˜ (n) Lemma B.2. As n → ∞, we have E α2(X , Y ) − α1(X , Y ) → 0, with α1 as in (n) (n) Appendix A.1 and α2(X˜ , Y˜ ) as in Lemma B.1.

Proof. Applying obvious changes of variables to α1, α2 and using the Lipschitz property of 1 ∧ exp{·} yield h i h i α (X˜ (n), Y˜ (n)) − α (X˜ (n), Y˜ (n)) ≤ ε (X(n), Y(n)) − ε (X(n), Y(n)) 1 (Y ) E 2 1 E 2 1 X1 1 " n # X  ∂ ∂  ≤ log f(Xi|Y1) − log f(Xi|X1) (Yi − Xi) 1X (Y1) (B.2) E ∂Xi ∂Xi 1 i=2 2     2  2  ` n − 1 ∂ ∂ 1 + E log f(Xi|Y1) − log f(Xi|X1) X1 (Y1) . 2 n ∂Xi ∂Xi The summation in (B.2) is distributed according to a normal random variable with null mean 2  2 and variance ` Pn ∂ log f(X |Y ) − ∂ log f(X |X ) . Using H¨older’sinequality, the n i=2 ∂Xi i 1 ∂Xi i 1 corresponding expectation is bounded by

    2 1/2 2 n − 1 ∂ ∂ 1 ` E log f(Xi|Y1) − log f(Xi|X1) X1 (Y1) . (B.3) n ∂Xi ∂Xi

Since Y1 →a.s. x1, we use the Continuous Mapping Theorem to affirm that the integrand converges to 0 almost surely. By assumption, we know that [( ∂ log f(X |X ))4] < ∞. E ∂Xi i 1

21 From the proof of Proposition C.1, we also know that [( ∂ log f(X |Y ))41 (Y )] < ∞. E ∂Xi i 1 X1 1 We can thus use the Uniform Integrability Theorem to deduce that the expectation in (B.3) converges to 0 as n → ∞. The exact same arguments may be used to conclude that the last term in (B.2) converges to 0 as n → ∞. h i Lemma B.3. As n → ∞ we have G h(X ) − G(1)h(X ) → 0, where G h(X ) and EX1:2 n 2 n 2 n 2 (1) (n) Gn h(X2) are in (A.3) and (A.5) respectively, with Yx2 = (Y1, x2,Y3,...,Yn), n o g(x(n), Y(n)) = exp{ε(x(n), Y(n))} 1 exp{ε(x(n), Y(n))} < 1 , (B.4)

and n f1(Y1) f(Y2|Y1) X ε(x(n), Y(n)) = log + log + (log f(Y |Y ) − log f(x |x )) . (B.5) f (x ) f(x |x ) i 1 i 1 1 1 2 1 i=3

Proof. The acceptance rule in (A.3) may be written α(x(n), Y(n)) = 1 ∧ exp{ε(x(n), Y(n))}, (n) (n) 2 where the candidates are generated according to Y ∼ N (x , ` In/n). We note that a candidate Y1 ∈/ X1 is automatically rejected by the algorithm, and thus corresponds to an acceptance probability that is null. It thus not cause any problem to express the acceptance (n) (n) 1 function as α(x , Y ) X1 (Y1) wherever necessary. We first Taylor expand the acceptance function α(x(n), Y(n)) = 1 ∧ exp{ε(x(n), Y(n))} with respect to Y2 around x2. As argued in [13], this function is not everywhere differentiable. However, the points (x(n), y(n)) at which the derivatives do not exist have a Lebesgue measure that is either null or converging exponentially to 0 as n → ∞; hence this shall not cause any concern when considering expectations of generators. (The latter may happen if f1 and f are constant over some interval of the state space, for instance, in which case we could have (n) (n) (n) P(π(Y ) = π(x )) > 0. The occurence of such values x however has a probability converging exponentially rapidly to 0 as n → ∞). (n) (n) The first-order derivative of α(x , Y ) with respect to Y2 is given by ∂ α(x(n), Y(n)) = ∂ log f(Y |Y ) g(x(n), Y(n)) , ∂Y2 ∂Y2 2 1 where the function g is as in (B.4); the second-order derivative is expressed as

 2 ∂2 (n) (n) ∂2  ∂  (n) (n) 2 α(x , Y ) = 2 log f(Y2|Y1) + log f(Y2|Y1) g(x , Y ). ∂Y2 ∂Y2 ∂Y2

The generator in (A.3) is thus developed as h i G h(X ) = n [h(Y ) − h(X )] α(X(n), Y(n))1 (Y ) n 2 EY2 2 2 EX3:n,Y1,3:n X2 X1 1   + n [(h(Y ) − h(X )) (Y − X )] ∂ α(X(n), Y(n)) 1 (Y ) EY2 2 2 2 2 EX3:n,Y1,3:n ∂Y2 X1 1 Y2=X2 + Rn(X1:2,U2), (B.6) where

h h 2 ii n 2 ∂ (n) (n) 1 Rn(X1:2,U2) = EY2 (h(Y2) − h(X2)) (Y2 − X2) EX3:n,Y1,3:n 2 α(X , YU ) X1 (Y1) (B.7) 2 ∂U2 2

22 for some U2 ∈ (X2,Y2) or U2 ∈ (Y2,X2). This leads to h i G h(X ) − G(1)h(X ) ≤ [|R (X ,U )|] EX1:2 n 2 n 2 E n 1:2 2 h `2 00 h (n) (n) ii + X n Y [h(Y2) − h(X2)] − h (X2) X ,Y α(X , Y )1X (Y1) (B.8) E 1:2 E 2 2 E 3:n 1,3:n X2 1    ∂ (n) (n) + X n Y [(h(Y2) − h(X2)) (Y2 − X2)] X ,Y α(X , Y ) 1X (Y1) E 1:2 E 2 E 3:n 1,3:n ∂Y2 1 Y2=X2 2 0 ∂ h (n) (n) i i −` h (X2) log f(X2|X1) X ,Y g(X , Y )1X (Y1) . ∂X2 E 3:n 1,3:n X2 1

The remainder term in (B.7) converges to 0 in L1, as now detailed. By using a first-order ∞ Taylor expansion of h with respect to Y2 around x2 along with the fact that h ∈ Cc , it follows that |h(Y ) − h(x )| ≤ K |Y − x | for some K > 0. Furthermore, since ∂ log f(x |x ) is 2 2 1 2 2 1 ∂x2 2 1 ∂2 Lipschitz continuous on R for all fixed x1 ∈ X1, then | 2 log f(x2|x1)| ≤ K(x1). Using the ∂x2 fact that the function g in (B.4) is bounded by 1, we then write

3/2 3 n 2 ` 1 E [|Rn(X1:2,U2)|] ≤ K1 √ E[K(Y1) X1 (Y1)] 2 π n3/2   2  n 3 ∂ 1 + K1 E |Y2 − X2| log f(U2|Y1) X1 (Y1) . 2 ∂U2

Since | ∂ log f(U |Y )| ≤ | ∂ log f(x |x )|+L(x )|Y −x |+K(Y )|Y −x | and (a+b+c)2 ≤ ∂U2 2 1 ∂x2 2 1 2 1 1 1 2 2 2 2 2 4(a + b + c ) for a, b, and c in R, then

r 3   2 2 ` 1 ∂ E [|Rn(X1:2,U2)|] ≤ K1 E[K(Y1) X1 (Y1)] + 4E log f(X2|X1) π n1/2 ∂X2 r 5 r 5 2 ` 2 32 ` 2 1 + 4K1 E[L (X2)] + 4K1 E[K (Y1) X1 (Y1)] . π n3/2 π n3/2 2 1 As argued in the proof of Lemma B.1, E[K (Y1) X1 (Y1)] < ∞; furthermore, the other expec- tations on the right are finite by assumption. The three terms on the right thus are O(n−1/2), −3/2 −3/2 O(n ), and O(n ), which implies that E [|Rn(X1:2,U2)|] → 0 as n → ∞. We now turn to the second term on the RHS of (B.8); since the acceptance function takes values in [0, 1], this term is bounded by

 2  ` 00 n   000 3  EX2 nEY2 [h(Y2) − h(X2)] − h (X2) ≤ EX2 EY2 h (U2)(Y2 − X2) 2 6

for some U2 ∈ (X2,Y2) or U2 ∈ (Y2,X2). The term on the right arises from a third-order Tay- 2 lor expansion of h with respect to Y2 around X2, along with the fact that Y2 ∼ N (X2, ` /n). 000 3 √ Since |h | is bounded by a constant, the previous expression is bounded by K2` / n for some K2 > 0, which converges to 0 as n → ∞. In a similar fashion, by Taylor expanding h to second order and using the fact that the 00 functions |h | and g are bounded by K3 > 0 and 1 respectively, the third term on the RHS

23 of (B.8) satisfies

h h ∂ (n) (n) i X n X ,Y (h(Y2) − h(X2)) (Y2 − X2) log f(X2|Y1)g(X , Y )1X (Y1) E 1:2 E 3:n 1:n ∂X2 X2 1 2 0 ∂ h (n) (n) i i −` h (X2) log f(X2|X1) X ,Y g(X , Y )1X (Y1) ∂X2 E 3:n 1,3:n X2 1 2 h 0 ∂ ∂ i ≤ ` |h (X2)| log f(X2|Y1) − log f(X2|X1) 1X (Y1) E ∂X2 ∂X2 1 1 `3 h i ∂ 1 +√ K3 E log f(X2|Y1) X1 (Y1) . 2π n1/2 ∂X2

From the Lipschitz continuity of ∂ log f(x |x ) and the fact that h0 is bounded in absolute ∂x2 2 1 2 value,√ the first term on the right of the inequality is bounded by ` K4E[L(X2)|Y1 − X1|] ≤ 3 √ −1/2 ` 2K4E[L(X2)]/ πn for some K4 > 0; it is thus O(n ). The second term also is h i O(n−1/2) since | ∂ log f(X |Y )|1 (Y ) < ∞ (from the proof of Proposition C.1). E ∂X2 2 1 X1 1

Lemma B.4. As n → ∞, we have

h (n) (n) (n) (n) i X ,Y α(X , Y ) − αˆ(X , Y ) 1X (Y1) → 0 , E 1:n 1,3:n X2 X2 1 where α(x(n), Y(n)) = 1 ∧ exp{ε(x(n), Y(n))} with ε as in (B.5) and αˆ(x(n), Y(n)) = 1 ∧ exp{εˆ(x(n), Y(n))} with

n (n) (n) f1(Y1) f(Y2|Y1) X ∂ εˆ(x , Y ) = log + log + log f(xi|x1)(Y1 − x1) (B.9) f (x ) f(x |x ) ∂x1 1 1 2 1 i=3 n n n 2 2 1 X ∂2 2 X ∂ ` X  ∂  + 2 log f(xi|x1)(Y1 − x1) + ∂x log f(xi|x1)(Yi − xi) − ∂x log f(xi|x1) . 2 ∂x1 i 2n i i=3 i=3 i=3

Proof. The function ε in (B.5) is reexpressed as

n f1(Y1) f(Y2|Y1) X ε(x(n), Y(n)) = log + log + (log f(Y |Y ) − log f(Y |x )) f (x ) f(x |x ) i 1 i 1 1 1 2 1 i=3 n X + (log f(Yi|x1) − log f(xi|x1)) . i=3

Using second-order Taylor expansions with respect to Yi around xi (i = 3, . . . , n) to reexpress

24 the last two terms on the right hand side leads to n f1(Y1) f(Y2|Y1) X ε(x(n), Y(n)) = log + log + (log f(x |Y ) − log f(x |x )) f (x ) f(x |x ) i 1 i 1 1 1 2 1 i=3 n X   + ∂ log f(x |Y ) − ∂ log f(x |x ) (Y − x ) ∂xi i 1 ∂xi i 1 i i i=3 n 1 X  ∂2 ∂2  2 + 2 log f(Ui|Y1) − 2 log f(Ui|x1) (Yi − xi) 2 ∂Ui ∂Ui i=3 n n X ∂ 1 X ∂2 2 + ∂x log f(xi|x1)(Yi − xi) + 2 log f(xi|x1)(Yi − xi) i 2 ∂xi i=3 i=3 n 1 X  ∂2 ∂2  2 + 2 log f(Vi|x1) − 2 log f(xi|x1) (Yi − xi) 2 ∂Vi ∂xi i=3

for some Ui,Vi ∈ (xi,Yi) or Ui,Vi ∈ (Yi, xi). Furthermore, by Taylor expanding the third term of the previous expression to second order (with respect to Y1 around x1) we obtain n X (log f(xi|Y1) − log f(xi|x1)) i=3 n n X ∂ 1 X ∂2 2 = ∂x log f(xi|x1)(Y1 − x1) + 2 log f(xi|x1)(Y1 − x1) 1 2 ∂x1 i=3 i=3 n 1 X  ∂2 ∂2  2 + 2 log f(xi|U1) − 2 log f(xi|x1) (Y1 − x1) 2 ∂U1 ∂x1 i=3

for some U1 ∈ (x1,Y1) or U1 ∈ (Y1, x1). Using the Lispchitz property of 1 ∧ exp{·} yields

h (n) (n) (n) (n) i 1 ∧ exp{ε(X , Y )} − 1 ∧ exp{εˆ(X , Y )} 1X (Y1) E X2 X2 1 " n # 1  2 2  X ∂ ∂ 21 ≤ E 2 log f(Xi|U1) − 2 log f(Xi|X1) (Y1 − X1) X1 (Y1) 2 ∂U1 ∂X1 i=3 " n # X  ∂ ∂  + log f(Xi|Y1) − log f(Xi|X1) (Yi − Xi)1X (Y1) E ∂Xi ∂Xi 1 i=3 " n # 1  2 2  X ∂ ∂ 21 + E 2 log f(Ui|Y1) − 2 log f(Ui|X1) (Yi − Xi) X1 (Y1) 2 ∂Ui ∂Ui i=3 " n 2 n # 1 2 `  2 X ∂ 2 X ∂ +E 2 log f(Xi|X1)(Yi − Xi) + ∂X log f(Xi|X1) 2 ∂Xi 2n i i=3 i=3 " n # 1  2 2  X ∂ ∂ 2 + E 2 log f(Vi|X1) − 2 log f(Xi|X1) (Yi − Xi) . (B.10) 2 ∂Vi ∂Xi i=3

It remains to show that each term on the right hand side converges to 0 as n → ∞. We look at the first term of (B.10). Using the triangle’s inequality and the fact that (Y1 − X1) ∼

25 N (0, `2/n), we have

" n # 1  2 2  X ∂ ∂ 21 E 2 log f(Xi|U1) − 2 log f(Xi|X1) (Y1 − X1) X1 (Y1) 2 ∂U1 ∂X1 i=3 2   ` n − 2 h 2 2 i ∂ ∂ 21 ≤ E 2 log f(X3|U1) − 2 log f(X3|X1) Z1 X1 (Y1) , 2 n ∂U1 ∂X1 √ where Z1 = n(Y1 − x1)/` ∼ N (0, 1). Since |U1 − X1| ≤ |Y1 − X1| and Y1 ∈ X1, then U1 ∈ X1; in addition, Y1 →a.s. X1 implies that U1 →a.s. X1. By the Continuous Mapping ∂2 ∂2 1 Theorem, | 2 log f(X3|U1) − 2 log f(X3|X1)| X1 (Y1) →a.s. 0. Since this term is bounded ∂U1 ∂X1 2 by 2M(X3) ≥ 0 and that 2E[M(X3)Z1 ] = 2E[M(X3)] < ∞, the Dominated Convergence Theorem can be used to conclude that the first term on the right of (B.10) converges to 0 as n → ∞.

We now consider the second term. Given x1,Y1 ∈ X1 and xi ∈ R (i = 3, . . . , n), n X   ∂ log f(x |Y ) − ∂ log f(x |x ) (Y − x ) ∂xi i 1 ∂xi i 1 i i i=3 n ! 2 2 ` X  ∂ ∂  ∼ N 0, log f(xi|Y1) − log f(xi|x1) . n ∂xi ∂xi i=3 Computing the expectation of the half-normal distribution, applying Jensen’s inequality (for the square root function, which is concave), and then the triangle inequality lead to

" n # X  ∂ ∂  X ,Y log f(Xi|Y1) − log f(Xi|X1) (Yi − Xi)1X (Y1) E 1,3:n 1,3:n ∂Xi ∂Xi 1 i=3 s 2     2 1/2 2` n − 2 ∂ ∂ 1 ≤ EX1,3,Y1 log f(X3|Y1) − log f(X3|X1) X1 (Y1) . π n ∂X3 ∂X3

∂ ∂ 2 Since Y1 →a.s. X1, then ∂X log f(X|Y1) − ∂X log f(X|X1) →a.s. 0 by the Continuous Mapping Theorem. Furthermore, we know that h i ∂ ∂ 4 1 E ∂X log f(X|Y1) − ∂X log f(X|X1) X1 (Y1) `4 ≤ L4(X)(Y − x )4 = 3 L4(X) < ∞ ; E 1 1 n2 E the Uniform Integrability Theorem can then be used to conclude that the second term on the right hand side of (B.10) converges to 0 as n → ∞. 2 Using the triangle’s inequality and the fact that (Yi − Xi) ∼ N (0, ` /n)(i = 3, . . . , n), the third term on the right hand side of (B.10) is bounded by

2   ` n − 2 h 2 2 i ∂ ∂ 21 E 2 log f(U3|Y1) − 2 log f(U3|X1) Z3 X1 (Y1) , 2 n ∂U3 ∂U3 √ where Z3 = n(Y3−X3)/` ∼ N (0, 1). Given that Y1 →a.s. X1, the Continuous Mapping The- ∂2 ∂2 orem implies that 2 log f(U3|Y1)− 2 log f(U3|X1) →a.s. 0 as n → ∞. We again invoke ∂U3 ∂U3

26 the Uniform Integrability Theorem to conclude that the third term on the right converges to 0 as n → ∞, since    2 2 2 ∂ ∂ 41 E 2 log f(U3|Y1) − 2 log f(U3|X1) Z3 X1 (Y1) ∂U3 ∂U3  2 1   2  ≤ 6E K (Y1) X1 (Y1) + 6E K (X1) < ∞ .

Replacing Y1 by X1 in the proof of Proposition C.1, the fourth term on the right of (B.10) is easily seen to converge towards 0 as n → ∞. Finally, the last term is bounded by

2   ` n − 2 h 2 2 i ∂ ∂ 2 E 2 log f(V3|X1) − 2 log f(X3|X1) Z3 , 2 n ∂V3 ∂X3 √ with Z3 = n(Y3 −X3)/`. Given that Y3 →a.s. X3 and |V3 −X3| ≤ |Y3 −X3|, we have V3 →a.s. X3 and the Continuous Mapping Theorem implies that the integrand converges to 0 almost 2 2 surely. Furthermore, the integrand is bounded by 2K(X1)Z3 and since 2E[K(X1)Z3 ] = 2E[K(X1)] < ∞, the Dominated Convergence Theorem is used to conclude the proof.

Lemma B.5. As n → ∞, we have

h 2 h (n) (n) i i X ` X ,Y αˆ(X , Y )1X (Y1) − υ(`, X1) → 0 , E 1:2 E 3:n 1,3:n X2 1

where αˆ(x(n), Y(n)) = 1 ∧ exp{εˆ(x(n), Y(n))} with the function εˆ as in (B.9) and the function υ as in (6).

Proof. We have from the triangle inequality

h 2 h (n) (n) i i X ` X ,Y αˆ(X , Y )1X (Y1) − υ(`, X1) ≤ E 1:2 E 3:n 1,3:n X2 1  h i    2 (n) (n) 1 ` 1/2 ` EX1:2,Z1 EX3:n,Y3:n 1 ∧ exp{εˆ(X , Y )} X1 (Y1) − 2Φ − γ (X1,Z1) , X2 2 √ where Z1 = n(Y1 − x1)/` ∼ N (0, 1). From the boundedness of the absolute value in the above expression, it is sufficient to show that, conditionally on x1 ∈ X1, x2,Z1 ∈ R, h i   (n) (n) 1 ` 1/2 EX3:n,Y3:n 1 ∧ exp{εˆ(X , Y )} X1 (Y1) − 2Φ − γ (X1,Z1) → 0 as n → ∞ . X2 2

The functionε ˆ being evaluated at Y2 = x2, it is reexpressed as

√` √` f1(x1 + n Z1) f(x2|x1 + n Z1) (n) √` εˆ(x , (x1 + n Z1, x2, Y3:n)) = log + log (B.11) f1(x1) f(x2|x1) n 2 n ` X ∂ ` X ∂2 2 +√ ∂x log f(xi|x1)Z1 + 2 log f(xi|x1)Z1 n 1 2n ∂x1 i=3 i=3 n n 2 2 X ∂ ` X  ∂  + log f(xi|x1)(Yi − xi) − log f(xi|x1) . ∂xi 2n ∂xi i=3 i=3

27 In the sequel, we thus condition on x1 ∈ X1, x2,Z1 ∈ R, and study the convergence of ∗ every term in (B.11) as n → ∞. Given any x1 ∈ X1 and Z1 ∈ R, ∃n ≥ 1 such that √` ∗ x1 + n Z1 ∈ X1 for all n ≥ n ; it therefore follows from the continuity of functions that log{f1(Y1)/f1(x1)} → 0 and log{f(x2|Y1)/f(x2|x1)} → 0 for any given x2 ∈ R. We now show that conditionally on x1,Z1, the remaining terms are asymptotically distributed according to a normal random variable.

Given any x1 ∈ X1, Z1 ∈ R, applying the Central Limit Theorem to the third term of (B.11) yields

n   2 ` X ∂ 2 2  ∂  √ Z1 log f(Xi|x1) →d N 0, ` Z1 EX log f(X|x1) . n ∂x1 ∂x1 i=3

This follows from the regularity assumptions in Section 2, which imply that ∂ f(x|x ) is ∂x1 1 locally integrable and thus that we can differentiate outside of the integral sign to obtain h i Z ∂ log f(X|x ) = d f(x|x ) dx = 0. EX ∂x1 1 dx1 1 R

To study the fourth term, we condition on x1 ∈ X1, Z1 ∈ R and use the SLLN to get

2 n 2 ` 2 X ∂2 ` 2 h ∂2 i Z1 2 log f(Xi|x1) →a.s. Z1 EX 2 log f(X|x1) . 2n ∂x1 2 ∂x1 i=3

∂2 Again from the regularity assumptions, 2 f(x|x1) is locally integrable and thus the following ∂x1 identity holds :

 2 Z h ∂2 i  ∂  d2 EX 2 log f(X|x1) + EX log f(X|x1) = 2 f(x|x1) dx = 0 ; ∂x1 ∂x1 dx1 R

 2 h ∂2 i  ∂  therefore, EX 2 log f(X|x1) = −EX log f(X|x1) for all x1 ∈ X1. ∂x1 ∂x1 Combining the previous developments and making use of Slutsky’s Theorem allows us to conclude that given any x1 ∈ X1,Z1 ∈ R,

n 2 n ` X ∂ ` 2 X ∂2 √ Z1 ∂x log f(Xi|x1) + Z1 2 log f(Xi|x1) n 1 2n ∂x1 i=3 i=3  2  2  2 ` 2  ∂  2 2  ∂  →d N − Z1 EX log f(X|x1) , ` Z1 EX log f(X|x1) . 2 ∂x1 ∂x1

Now, given any x1 ∈ X1, the last two terms on the right of (B.11) satisfy

n n 2 2 ` X ∂ ` X  ∂  √ log f(Xi|x1)Zi − log f(Xi|x1) , n ∂Xi 2n ∂Xi i=3 i=3 2  ` h 2i h 2i → N − ∂ log f(X|x ) , `2 ∂ log f(X|x ) ; p 2 EX ∂X 1 EX ∂X 1

28 this follows from the WLLN and the fact that Zi ∼ N (0, 1) independently for i = 3, . . . , n.

Given x1 ∈ X1, Z1 ∈ R, the two normal random variables just introduced are asymptotically √ 2 independent (this is easily seen from the fact that n(Y3:n − x3:n)/` is independent of x3:n ∀n ≥ 3). We therefore conclude that given any X ∈ X and Z ∈ ,ε ˆ(X(n), Y(n)) → 1 1 1 R X2 d 2 2  η(X1,Z1), where η(x1,Z1) ∼ N −` γ(x1,Z1)/2, ` γ(x1,Z1) , with

 2 h 2i γ(x ,Z ) = Z2 ∂ log f(X|x ) + ∂ log f(X|x ) . 1 1 1 EX ∂x1 1 EX ∂X 1

1 √` It easily follows from the fact that X1 (x1 + n Z1) → 1 given any x1 ∈ X1, Z1 ∈ R, Slutsky’s Theorem, and the Continuous Mapping Theorem, that 1 ∧ exp{εˆ(X(n), Y(n))}1 (Y ) → X2 X1 1 d 1 ∧ exp{η(X1,Z1)}. From Proposition 2.4 in [15], we know that given x1,Z1,

 `  [1 ∧ exp{η(x ,Z )}] = 2Φ − γ1/2(x ,Z ) . Eη 1 1 2 1 1

From the convergence in distribution and the boundedness (and thus uniform integrability) of the random variables, the means are known to converge, i.e. given any x1 ∈ X1 and x2,Z1 ∈ R h i   (n) (n) 1 ` 1/2 EX3:n,Y3:n 1 ∧ exp{εˆ(X , Y )} X1 (Y1) − 2Φ − γ (X1,Z1) → 0 , X2 2 which concludes the proof.

Lemma B.6. As n → ∞, we have

 h i  ∂ 2 (n) (n) 1 1 EX1:2 | log f(X2|X1)| ` EX3:n,Y1,3:n g(X , Y ) X1 (Y1) − υ(`, X1) → 0 , ∂X2 X2 2

where the function g is as in (B.4).

Proof. Making use of the triangle inequality, we may bound the expectation in the statement of the lemma by

 h i    2 ∂ (n) (n) 1 ` 1/2 ` EX1:2,Z1 | log f(X2|X1)| EX3:n,Y3:n g(X , Y ) X1 (Y1) − Φ − γ (X1,Z1) , ∂X2 X2 2 h i which is itself bounded by 2 | ∂ log f(X |X )| < ∞ since each term in the difference is E ∂X2 2 1 bounded by 1 in absolute value. We can thus use the Dominated Convergence Theorem to bring the limit inside the first expectation. To conclude the proof, all is left to do is to verify that given any X1 ∈ X1, X2,Z1 ∈ R, h i   (n) (n) 1 ` 1/2 EX3:n,Y3:n g(X , Y ) X1 (Y1) − Φ − γ (X1,Z1) → 0 , X2 2 √ where Z1 = n(Y1 − X1)/`.

29 In the proof of Lemma B.4 we have verified, among other things, that [|ε(X(n), Y(n)) − E X2 εˆ(X(n), Y(n))|1 (Y )] → 0 as n → ∞. This L1-convergence thus entails that |ε(X(n), Y(n))− X2 X1 1 X2 εˆ(X(n), Y(n))|1 (Y ) → 0. From the proof of Lemma B.5 we know that given any X ∈ X2 X1 1 p 1 X and X ,Z ∈ ,ε ˆ(X(n), Y(n))1 (Y ) → η(X ,Z ). Using Slutsky’s Theorem, these 1 2 1 R X2 X1 1 d 1 1 convergences imply that, conditionally on X ∈ X and X ,Z ∈ , ε(X(n), Y(n))1 (Y ) → 1 1 2 1 R X2 X1 1 d η(X1,Z1).

From the Continuous Mapping Theorem, we deduce that given any X1 ∈ X1, Z1 ∈ R, g(X(n), Y(n))1 (Y ) → exp{η(X ,Z )} 1 {exp{η(X ,Z )} < 1} . X2 X1 1 d 1 1 1 1 The function under study is obviously not continuous; however, the discontinuities of the function on the right have null Lebesgue measure and thus the Continuous Mapping Theorem is applicable as stated in [7] (Theorem 5.1 and its corollaries).

By examining the proof of Proposition 2.4 in [15] we obtain, conditionally on X1 ∈ X1, Z1 ∈ R,  `  [exp{η(X ,Z )} 1 {exp{η(X ,Z )} < 1}] = Φ − γ1/2(X ,Z ) . Eη 1 1 1 1 2 1 1

From the convergence in distribution and the fact that the random variables under consid- eration are bounded (and thus uniformly integrable), the means are known to converge; this concludes the proof of the lemma.

C Appendix

Proposition C.1. Define n n 2 2 (n) (n) 1 X ∂2 2 ` X  ∂  W (X , Y ) = 2 log f(Xi|Y1)(Yi − Xi) + ∂X log f(Xi|Y1) ; 2 ∂Xi 2n i i=2 i=2 (n) (n) 1 then, E[|W (X , Y )| X1 (Y1)] → 0 as n → ∞. p Proof. By Jensen’s inequality, E[|W |] ≤ E[W 2]. Developing the square and taking the expectation with respect to Y2:n yield n 4 2 h 2 (n) (n) i ` X  ∂2  EY2:n W (X , Y ) = 2 2 log f(Xi|Y1) 2n ∂Xi i=2 ( n )2 4  2 ` X ∂2  ∂  + 2 2 log f(Xi|Y1) + ∂X log f(Xi|Y1) , 4n ∂Xi i i=2 which implies n !1/2 2 2 h (n) (n) i ` 1 X  ∂2  EY2:n |W (X , Y )| ≤ √ 2 log f(Xi|Y1) n ∂Xi 2n i=2 2 n   ` 1 2  2 X ∂ ∂ + 2 log f(Xi|Y1) + ∂X log f(Xi|Y1) . 2 n ∂Xi i i=2

30 Reapplying Jensen’s inequality on the first term and developing the second term lead to

2   1/2  2 2 (n) (n) 1 ` ∂ 1 E[|W (X , Y )| X1 (Y1)] ≤ √ E 2 log f(X|Y1) X1 (Y1) 2n ∂X 2 " n   # ` 1 2  2 X ∂ ∂ + E 2 log f(Xi|X1) + ∂X log f(Xi|X1) 2 n ∂Xi i i=2 2   ` n − 1 h 2 2 i ∂ ∂ 1 + E 2 log f(Xi|Y1) − 2 log f(Xi|X1) X1 (Y1) 2 n ∂Xi ∂Xi 2     2  2  ` n − 1 ∂ ∂ 1 + E log f(Xi|Y1) − log f(Xi|X1) X1 (Y1) . 2 n ∂Xi ∂Xi

2   2 1  1/2 The first term on the right is bounded by ` E K (Y1) X1 (Y1) /(2n) , which converges to 0 as n → ∞ from the argument at the end of the proof of Lemma B.1. From Lemma 12 ∂ in [2], we know that ∂x log f(x|x1) → 0 as x → ±∞, ∀x1 ∈ X1; hence, given x1, we have ∂2 ∂ 2 R ∂2 EX [ ∂X2 log f(X|x1) + ∂X log f(X|x1) ] = ∂x2 f(x|x1)dx = 0 and by the WLLN, n   1 2  2 X ∂ ∂ 2 log f(Xi|X1) + ∂X log f(Xi|X1) →p 0 . n ∂Xi i i=2 To invoke the Uniform Integrability Theorem for the second term, we use the finiteness of ∂2 2 2 ∂ 4 E[( ∂X2 log f(X|X1)) ] ≤ E[K (X1)] and E[( ∂X log f(X|X1)) ].

From Y1 →a.s. x1 and the Continuous Mapping Theorem, the integrands of the last two 2 1 terms are seen to converge to 0 almost surely. Since E[K (Y1) X1 (Y1)] < ∞ (Section A.2) 2 and E[K (X1)] < ∞, the third term converges to 0 using the Uniform Integrability Theorem. ∂ 4 We come to the same conclusion for the fourth term, using E[( ∂X log f(X|X1)) ] < ∞ and h i h i ∂ 4 1 ∂ ∂ 4 1 E ∂X log f(X|Y1) X1 (Y1) ≤ 8E ∂X log f(X|Y1) − ∂X log f(X|X1) X1 (Y1) h ∂ 4i +8E ∂X log f(X|X1) 4 ` h 4i ≤ 24 L4(X) + 8 ∂ log f(X|X ) < ∞ . n2 E E ∂X 1

31