<<

reduction for MCMC via martingale representations

D. Belomestny1,3 and E. Moulines2,3 and S. Samsonov3 1 Duisburg-Essen University, Faculty of Mathematics, D-45127 Essen Germany 2 Centre de Math´ematiquesAppliqu´ees,UMR 7641, Ecole Polytechnique, France 3 National University Higher School of Economics, Moscow, Russia

Abstract: In this paper we propose an efficient variance reduction ap- proach for additive functionals of Markov chains relying on a novel discrete time martingale representation. Our approach is fully non-asymptotic and does not require the knowledge of the stationary distribution (and even any type of ergodicity) or specific structure of the underlying density. By rigorously analyzing the convergence properties of the proposed , we show that its cost-to-variance product is indeed smaller than one of the naive algorithm. The numerical performance of the new method is illus- trated for the Langevin-type Monte Carlo (MCMC) meth- ods. MSC 2010 subject classifications: Primary 60G40, 60G40; secondary 91G80. Keywords and phrases: MCMC, variance reduction, martingale repre- sentation.

1. Introduction

Markov chains and Markov Chain Monte Carlo algorithms play crucial role in modern , finding various applications in such research areas as , reinforcement learning, online learning. As an illustra- tion, suppose that we aim at computing π(f) := E [f(X)], where X is a random vector in X ⊆ Rd admitting a density π and f is a square-integrable function f : X → , f ∈ L2(π). Typically it is not possible to compute π(f) analytically, leading to the Monte Carlo methods as a popular solution. Given an independent identically distributed observations X1,...,Xn from π, we might estimate π(f) Pn 2 byπ ˆ(f) := 1/n k=1 f(Xk). The variance of such estimate equals σ (f)/n with arXiv:1903.07373v3 [stat.CO] 11 May 2020 σ2(f) being the variance of the integrand. The first way to obtain tighter esti- mateπ ˆ(f) is simply to increase the sample size n. Unfortunately, this solution might be prohibitively costly, especially when the dimension d is large enough or sampling from π is complicated. An alternative approach is to decrease σ2(f) by constructing a new Monte-Carlo experiment with the same expectation as the original one, but with a lower variance. Methods to achieve this are known as variance reduction techniques. Introduction to many of them can be found in

1 D. Belomestny et al/Variance reduction for MCMC algorithms 2

[24], [13] and [12]. Recently one witnessed a revival of interest in efficient vari- ance reduction methods for Markov chains, mostly with applications to MCMC algorithms; see for example [8], [20], [21], [25], [6] and references therein. One of the popular approaches to variance reduction is the control variates method. We aim at constructing cheaply computable ζ (control variate) with E[ζ] = 0 and E[ζ2] < ∞, such that the variance of the r.v. f(X)+ζ is small. The complexity of the problem of constructing classes of control vari- ates ζ satisfying E[ζ] = 0 essentially depends on the degree of our knowledge on π. For example, if π is analytically known and satisfies some regularity con- ditions, one can apply the well-known technique of polynomial interpolation to construct control variates enjoying some optimality properties, see, for example [9, Section 3.2]. Alternatively, if an orthonormal system in L2(π) is analytically available, one can build control variates ζ as a linear combination of the cor- responding basis functions, see [4]. Furthermore, if π is known only up to a normalizing constant (which is often the case in Bayesian ), one can apply the recent approach of control variates depending only on the gradient ∇ log π using Schr¨odinger-type Hamiltonian operator in [1], [20], and Stein op- erator in [6]. In some situations π is not known analytically, but X can be represented as a function of simple random variables with known distribution. Such situation arises, for example, in the case of functionals of discretized dif- fusion processes. In this case a Wiener chaos-type decomposition can be used to construct control variates with nice theoretical properties, see [3]. Note that in order to compare different variance reduction approaches, one has to analyze their complexity, that is, the number of numerical operations required to achieve a prescribed magnitude of the resulting variance. Unfortunately it is not always possible to generate independent observations distributed according to π. To overcome this issue one might consider MCMC algorithms, where the exact samples from π are replaced by (Xp), p = 0, 1, 2,..., forming a Markov chain with a marginal distribution of Xn converging to π in a suitable metrics as n grows. It is still possible to apply the control variates method in the similar manner to a simple Monte-Carlo case, yet the choice of the optimal control variate becomes much more difficult. Due to significant correlations between the elements of the Markov chain it might be not enough to minimize the marginal of (Xp)p≥0 as it was in independent case. Instead one may choose the control variate by minimizing the corresponding asymptotic variance of the chain as it is suggested in [2]. At the same time it is possible to express the optimal control variate in terms of the solution of the Poisson equation for the corresponding Markov chain (Xp). As it was observed in [16, 17], for a time-homogeneous Markov chain Xp with a stationary distribution π, the function UG(x) := G(x) − E[G(X1)|X0 = x] has zero mean with respect to π for an arbitrary real-valued function G : X → R, such that 1 G ∈ L (π). Hence, UG(x) is a valid control functional for a suitable choice of G, with the best G given by a solution of the Poisson equation E[G(X1)|X0 = x] − G(x) = −f(x) + π(f). For such G we obtain f(x) + UG(x) = f(x) − f(x) + π(f) = π(f) leading to an ideal with zero variance. Despite the fact that the Poisson equation involves the quantity of interest π(f) and can not D. Belomestny et al/Variance reduction for MCMC algorithms 3 be solved explicitly in most cases, this idea still can be used to construct some approximations for the optimal zero-variance control variates. For example, [16] proposed to compute approximations for the solution of the Poisson equation for specific Markov chains with particular emphasis on models arising in stochastic network theory. In [8] and [6] series-type control variates are introduced and studied for reversible Markov chains. It is assumed in [8] that the one-step conditional expectations can be computed explicitly for a set of basis functions. The authors in [6] proposed another approach tailored to diffusion setting which does not require the computation of integrals of basis functions and only involves applications of the underlying generator. In this paper we focus on the Langevin type algorithms which got much attention recently, see [7, 11, 18, 23, 22] and references therein. We propose a generic variance reduction method for these and other types algorithms, which is purely non-asymptotic and does not require that the conditional expectations of the corresponding Markov chain can be computed or that the generator is known analytically. Moreover, we do not need to assume stationarity or/and sampling under the invariant distribution π. We rigorously analyse the convergence of the method and study its complexity. It is shown that our variance-reduced Langevin algorithm outperforms the standard Langevin algorithms in terms of its cost-to-variance relation. The paper is organized as follows. In Section2 we set up the problem and introduce some notations. Section3 contains a novel martingale representation and shows how this representation can be used for variance reduction. In Sec- tion5 we analyze the performance of the proposed variance reduction algorithm in the case of Unadjusted Langevin Algorithm (ULA). Finally, numerical exam- ples are presented in Section6.

Notations. We use the notations N = {1, 2,...} and N0 = N ∪ {0}. We denote ϕ(z) = (2π)−d/2 exp(−|z|2/2), z ∈ Rd probability density function of the d−dimensional standard normal distribution. For x ∈ Rd and r > 0 let d Br(x) = {y ∈ R ||y − x| < r} where | · | is a standard Euclidean norm. For the twice differentiable function g : Rd → R we denote by D2g(x) its Hessian at point x. For m ∈ N, a smooth function h: Rd×m → R with arguments being d d denoted (y1, . . . , ym), yi ∈ R , i = 1, . . . , m, a multi-index k = (ki) ∈ N0, and j ∈ {1, . . . , m}, we use the notation ∂k h for the multiple derivative of h with yj respect to the components of yj: ∂k h(y , . . . , y ) := ∂kd . . . ∂k1 h(y , . . . , y ), y = (y1, . . . , yd). yj 1 m d 1 1 m j j j yj yj

In the particular case m = 1 we drop the subscript y1 in that notation. For d probability measures µ and ν on R denote by kµ − νkTV the total variation distance between µ and ν, that is,

kµ − νkTV = sup |µ(A) − ν(A)| . d A∈B(R ) where B(Rd) is a Borel σ-algebra of Rd. For a bounded Borel function f : Rd → R d denote osc(f) := sup d f(x) − inf d f(x). Given a function V : → , x∈R x∈R R R D. Belomestny et al/Variance reduction for MCMC algorithms 4

d for a function f : → we define kfk = sup d {|f(x)|/V (x)} and the R R V x∈R corresponding V −norm between probability measures µ and ν on B(Rd) as Z Z  dV (µ, ν) = kµ − νkV = sup f(x) dµ(x) − f(x) dν(x) . d d kfkV ≤1 R R

2. Setup

Let X be a domain in Rd. Our aim is to numerically compute expectations of the form Z π(f) = f(x)π(dx), X where f : X −→ R and π is a probability measure supported on X . If the dimension of the space X is large and π(f) can not be computed analytically, one can apply Monte Carlo methods. However, in many practical situations direct sampling from π is impossible and this precludes the use of plain Monte Carlo methods in this case. One popular alternative to Monte Carlo is Markov Chain Monte Carlo (MCMC) where one is looking for a discrete time (possibly non-homogeneous) Markov chain (Xp)p∈N0 such that π is its unique invariant measure. In this paper we study a class of MCMC algorithms with (Xp)p∈N0 satisfying the the following recurrence relation:

Xp = Φp(Xp−1, ξp), p = 1, 2,...,X0 = x (1)

m for some i.i.d. random vectors ξp ∈ R with distribution Pξ and some Borel- m measurable functions Φp : X × R → X . In fact, this is quite general class of Markov chains (see [10, Theorem 1.3.6]) and many well-known MCMC algo- rithms can be represented in the form (1). Let us consider two popular examples. Example 1 (Unadjusted Langevin Algorithm) Fix a sequence of positive d d time steps (γp)p≥1. Given a Borel function µ: R → R , consider a non-homogeneous discrete-time Markov chain (Xp)p≥0 defined by √ Xp+1 = Xp − γp+1µ(Xp) + γp+1Zp+1, (2) where (Zp)p≥1 is an i.i.d. sequence of d-dimensional standard Gaussian random vectors. If µ = ∇U for some continuously differentiable function U, then Markov chain (2) can be used to approximately sample from the density Z π(x) = Z−1e−U(x)/2,Z = e−U(x)/2 dx, (3) d R provided that Z < ∞. This method is usually referred to as Unadjusted Langevin Algorithm (ULA). Denote by W the standard Rm-valued Brownian motion. The Markov chain (2) arises as the Euler-Maruyama discretization of the Langevin diffusion dYt = −µ(Yt) dt + dWt D. Belomestny et al/Variance reduction for MCMC algorithms 5 with nonnegative time steps (γp)p≥1, and, under mild technical conditions, the latter Langevin diffusion admits π of (3) as its unique invariant distribution; see [7] and [11]. Example 2 (Metropolis-Adjusted Langevin Algorithm) The Metropolis- Hastings algorithm associated with a target density π requires the choice of a sequence of conditional densities (qp)p≥1 also called proposal or candidate ker- nels. The transition from the value of the Markov chain Xp at time p and its value at time p + 1 proceeds via the following transition step:

Given Xp = x; 1. Generate Yp ∼ qp(·|x); 2. Put ( Yp, with probability αp(x, Yp), Xp+1 = x, with probability 1 − αp(x, Yp),

where   π(y) qp(x|y) αp(x, y) = min 1, . π(x) qp(y|x)

This transition is reversible with respect to π and therefore preserves the stationary density π; see [10, Chapter 2]. If qp have a wide enough support to eventually reach any region of the state space X with positive mass under π, then this transition is irreducible and π is a maximal irreducibility measure [19]. The Metropolis-Adjusted Langevin algorithm (MALA) takes (2) as proposal, that is,

−d/2  √  qp(y|x) = (γp+1) ϕ [y − x + γp+1µ(x)]/ γp+1 .

The MALA algorithms usually provide noticeable speed-ups in convergence for most problems. It is not difficult to see that the MALA chain can be compactly represented in the form  √ Xp+1 = Xp + 1 Up+1 ≤ α(Xp,Yp) (Yp − Xp),Yp = Xp − γp+1µ(Xp) + γp+1Zp+1, where (Up)p≥1 is an i.i.d. sequence of uniformly distributed on [0, 1] random variables independent of (Zp)p≥1. Thus, we recover (1) with ξp = (Up,Zp) ∈ Rd+1 and

> √  √ Φp(x, (u, z) ) = x + 1 u ≤ α(x, x − γpµ(x) + γpz) (−γpµ(x) + γpz).

Example 3 Let (Xt)t≥0 be the unique strong solution to a SDE of the form:

dXt = b(Xt) dt + σ(Xt) dWt, t ≥ 0, (4) where b : Rd → Rd and σ : Rd × Rm → Rd are locally Lipschitz continuous functions with at most linear growth. The process (Xt)t≥0 is a Markov process D. Belomestny et al/Variance reduction for MCMC algorithms 6 and let L denote its infinitesimal generator defined by 1 Lg = b>∇g + σ>D2gσ 2 for any g ∈ C2(Rd). If there exists a continuously twice differentiable Lyapunov d function V : R → R+ such that sup LV (x) < ∞, lim sup LV (x) < 0, d x∈R |x|→∞ then there is an invariant probability measure π for X, that is, Xt ∼ π for all t > 0 if X0 ∼ π. Invariant measures are crucial in the study of the long term behaviour of stochastic differential systems (4). Under some additional assump- tions, the invariant measure π is ergodic and this property can be exploited to compute the integrals π(f) for f ∈ L2(π) by means of ergodic averages. The idea is to replace the diffusion X by a (simulable) discretization scheme of the form (see e.g. [23]) ¯ ¯ ¯ ¯ ¯ Xn+1 = Xn + γn+1b(Xn) + σ(Xn)(WΓn+1 − WΓn ), n ≥ 0, X0 = X0, where Γn = γ1 +...+γn and (γn)n≥1 is a non-increasing sequence of time steps. Then for a function f ∈ L2(π) we can approximate π(f) via

n 1 X πγ (f) = γ f(X¯ ). n Γ i i−1 n i=1

Due to typically high correlation between X0,X1,... variance reduction is of crucial importance here. As a matter of fact, in many cases there is no explicit formula for the invariant measure and this makes the use of gradient based variance reduction techniques (see e.g. [20]) impossible in this case.

3. Martingale representation

In this section we provide a general discrete-time martingale representation for Markov chains of the type (1) which is used later to construct an efficient variance reduction algorithm. Let (φk)k∈Z+ be a complete orthonormal system 2 m in L (R ,Pξ) with φ0 ≡ 1. In particular, we have

E[φi(ξ)φj(ξ)] = δij, i, j ∈ Z+ with ξ ∼ Pξ. Notice that this implies that the random variables φk(ξ), k ≥ 1, are centered. As an example, we can take multivariate Hermite polynomials for the ULA algorithm and a tensor product of shifted Legendre polynomials for ”uni- form part” and Hermite polynomials for ”Gaussian part” of the random variable ξ = (u, z)T in MALA, as the shifted Legendre polynomials are orthogonal with respect to the Lebesgue measure on [0, 1]. D. Belomestny et al/Variance reduction for MCMC algorithms 7

Let (ξp)p∈N be i.i.d. m−dimensional random vectors. We denote by (Gp)p∈N0 the filtration generated by (ξp)p∈N with the convention G0 = triv. For x ∈ d m d+m d R , y ∈ R and k ∈ N, let Φk(x, y) be a function mapping R to R . Then we set for l ≤ p x Xl,p := Gl,p(x, ξl, . . . , ξp), (5) d+m×(p−l+1) d with the functions Gl,p : R → R defined as

Gl,p(x, yl, . . . , yp) := Φp(·, yp) ◦ Φp−1(·, yp−1) ◦ · · · ◦ Φl(x, yl). (6)

x  d Note that X0,p is a Markov chain with values in R of the form (1), p∈N0 x starting at X0 = x. In the sequel we write Xp and Gp as a shorthand notation x for X0,p and G0,p, respectively. Theorem 4 For all p ∈ N, q ≤ p, j < q ≤ p, any Borel bounded functions f : Rd → R and all x ∈ Rd the following representation holds in L2(P)

∞ q x  x  X X x f(Xq ) = E f(Xq ) Gj + aq,l,k(Xl−1)φk (ξl) , (7) k=1 l=j+1

x d where (Xp )p≥0 is given in (5) and for all y ∈ R

h y i aq,l,k(y) = E f(Xl−1,q)φk (ξl) , q ≥ l, k ∈ N. (8)

p d Remark 5 Denote by F2 class of Borel functions f : R → R such that for d any n ≤ p, any indices 1 ≤ i1 < . . . < in ≤ p and y ∈ R , h 2i E f(Φin (·, ξin ) ◦ Φin−1 (·, ξin−1 ) ◦ · · · ◦ Φi1 (y, ξi1 )) < ∞ (9)

p Then the statement of Theorem4 remains valid for f ∈ F2 .

If all the functions Φl, l ≥ 1, in (5) are equal, then the condition (9) reduces  2 y  d to E f (Xq ) < ∞ for all q ≤ p and y ∈ R . Let us denote this class of functions p by F2,hom. x Corollary 6 Let (Xp )p≥0 be a homogeneous Markov chain of the form (5) with p d Φl = Φ, l ≥ 1. Then for all p ∈ N, q ≤ p, j < q ≤ p, f ∈ F2,hom and x ∈ R it holds in L2(P),

∞ q x  x  X X x f(Xq ) = E f(Xq ) Gj + a¯q−l,k(Xl−1)φk (ξl) k=1 l=j+1 where for all y ∈ Rd,

y a¯r,k(y) = E [f(Xr )φk (ξ1)] r, k ∈ N (10)

Another equivalent representation of the coefficients ap,l,k turns out to be more useful to construct : D. Belomestny et al/Variance reduction for MCMC algorithms 8

Proposition 7 Let q ≥ l, k ∈ N. Then the coefficients aq,l,k in (8) can be alternatively represented as

aq,l,k(x) = E [φk (ξ) Qq,l (Φl(x, ξ))]

h x i with Qq,l(x) = E f(Xl,q) , q ≥ l. In the homogeneous case Φl = Φ, the coeffi- cients a¯r,k in (10) are given respectively for all l ≥ 1, by

x a¯r,k(x) = E [φk (ξ) Qr (Φ(x, ξ))] with Qr(x) = E [f(Xr )] , r ∈ N. (11)

4. Variance reduction

Next we show how the representation (7) can be used to reduce the variance of MCMC algorithms. For the sake of clarity, in the sequel we consider only the time homogeneous case (Φl = Φ for all l ∈ N). Define

N+n 1 X πN (f) = f(Xx), (12) n n p p=N+1 where N ∈ N0 is the length of the burn-in period and n ∈ N is the number of effective samples. Fix some K ∈ N and denote

N+n " K p # 1 X X X M N (f) = a¯ (Xx )φ (ξ ) K,n n p−l,k l−1 k l p=N+1 k=1 l=N+1 K N+n 1 X X = A (Xx )φ (ξ ), (13) n N+n−l,k l−1 k l k=1 l=N+1 where q X Aq,k(y) = a¯r,k(y), q = 0, . . . , n − 1. (14) r=0

x Since Xl−1 is independent of ξl and E[φk(ξl)] = 0, k 6= 0, we obtain x x E[g(Xl−1)φk (ξl)] = E[g(Xl−1)E [φk (ξl) | Gl−1]] = 0

2 N for any function g ∈ L (π). Hence the r.v. MK,n(f) has zero mean and can be viewed as a control variate. The representation (8) suggests also that the variance of the estimator

N N N πK,n(f) = πn (f) − MK,n(f) (15)

0 should be small for K large enough. Indeed, since E[φk(ξl)φk0 (ξl)] = 0 if k 6= k , we obtain ∞ n 1 X X Var[πN (f)] = E[A2 (Xx )], (16) K,n n2 n−l,k N+l−1 k=K+1 l=1 D. Belomestny et al/Variance reduction for MCMC algorithms 9

N Hence Var[πK,n(f)] is small provided that the coefficients As,k decay fast enough as k → ∞. In Section5 we provide a detailed theoretical analysis of this decay for ULA (see Example1). The coefficients (¯al,k) need to be estimated before one can apply the proposed variance reduction approach. One way to estimate them is to use nonparametric regression. We present a generic regression algorithm and then in Section6 give further implementation details. Our algorithm starts with estimating the functions Qr for r = 0, . . . , n − 1, defined in (11). We first generate T paths of the chain X (the so-called “training paths”):

n (s) (s) o D = (X1 ,...,XN+n), s = 1,...,T

(s) with X0 = x, s = 1,...,T. Then we solve the least squares optimization problems

T 2 X (s) (s) Qbr ∈ arg min f(Xr+N ) − ψ(XN ) , r = 0, . . . , n − 1, (17) ψ∈Ψ s=1 where Ψ is a class of real-valued functions on Rd. We then set ( q ) Z X bar,k(x) = Qbr(Φ(x, z))φk(z)Pξ(dz), Abq,k(x) = min Wk, bar,k(x) , r=0 (18) where (Wk) is a sequence of truncation levels. Finally, we construct the empirical N estimate of MK,n(f) in the form

K N+n N 1 X X x Mc (f) := AbN+n−l,k(X )φk(ξl) . K,n n l−1 k=1 l=N+1

N N Obviously, E[McK,n(f)|D] = 0 and McK,n(f) is a valid control variate in that it does not introduce any bias. By the Jensen inequality and orthonormality of (φk)k≥0,

 2  N N E Mc (f) − M (f) D K,n K,n K n 1 X X h x x 2 i ≤ E |An−l,k(X ) − Abn−l,k(X )| D . n2 N+l−1 N+l−1 k=1 l=1 Set

N N N πbK,n(f) := πn (f) − McK,n(f). D. Belomestny et al/Variance reduction for MCMC algorithms 10

Combining this with (16) results in

∞ n 1 X X Var[πN (f)|D] = E[A2 (Xx )] bK,n n2 n−l,k N+l−1 k=K+1 l=1 K n 1 X X h x x 2 i + E |An−l,k(X ) − Abn−l,k(X )| D . (19) n2 N+l−1 N+l−1 k=1 l=1 In the next section we analyze the case of ULA algorithm and show that under some mild conditions (Lipschitz continuity and convexity outside a ball of the potential U),

2 x 2 E[Al,k(XN+l−1)] ≤ Rk, k ∈ N, l = 0, . . . , n − 1,

P∞ 2 for a sequence (Rk) satisfying k=1 Rk ≤ C with constant C not depending on γ and n (see (29)). Hence if we take Wk = Rk as truncation parameters in (18), then Abn−l,k converges to

2 A¯q,k := arg min E[|Aq,k(XN ) − ψ(XN )| ] ψ∈Ψ as T → ∞, see Section 11 in [14]. Moreover, it follows from (19) that

2C Var[πN (f)|D] ≤ . bK,n n

If the class of functions Ψ is a linear one of dimension DΨ, then the cost of computing the coefficients Abl,k for all l = 1, . . . , n, and k = 1,...,K, is of order 2 N DΨKT n. Given (Abl,k) the cost of evaluating πbK,n is proportional to DΨKn. At N the same time the variance of the standard estimate πn (f) is of order 1/(nγ) and this bound can not be improved, see Lemma8 and Remark9. Thus, the ratio of the corresponding cost-to-variance characteristics (product of cost and variance) can be bounded as

cost(πN )Var[πN (f)|D] bK,n bK,n 2 N N ≤ CKγ T DΨ. (20) cost(πn )Var[πn (f)] Thus, for any fixed K ≥ 1, our variance reduction method is advantageous if 2 √ γ  min{1/(DΨT ), 1/ n}. Note that in order to achieve convergence of the corresponding MSE to zero, we need anyway that γ → 0 as the invariant measure of the ULA with a constant time step γ is not equal to π.

5. Analysis of variance reduced ULA

In this section we perform the convergence analysis of the ULA algorithm. We use the notations of Example1. For the sake of clarity and notational simplicity we restrict our attention to the constant time step, that is, we take γk = γ for D. Belomestny et al/Variance reduction for MCMC algorithms 11 any k ∈ N. By Hk, k ∈ N0, we denote the normalized Hermite polynomial on R, that is, k k (−1) x2/2 ∂ −x2/2 Hk(x) := √ e e , x ∈ R. k! ∂xk d For a multi-index k = (ki) ∈ N0, Hk denotes the normalized Hermite polynomial on Rd, that is, d Y d Hk(x) := Hki (xi), x = (xi) ∈ R . i=1 Pd d In what follows, we also use the notation |k| = i=1 ki for k = (ki) ∈ N0, and we set Gp = σ(Z1,...,Zp), p ∈ N, and G0 = triv. Given N and n as above, for K ∈ N, denote

n 1 X X M N (f) := A (X )H (Z ) K,n n n−l,k N+l−1 k N+l 0

q X Aq,k(y) := a¯r,k(y). (21) r=0

N N N For an estimator ρn (f) ∈ {πn (f), πK,n(f)} of π(f) (see (12) and (15)), we shall be interested in the Mean Squared Error (MSE), which can be decomposed as the sum of the squared bias and the variance:

 N  h N 2i  N 2 N MSE ρn (f) = E ρn (f) − π(f) = E[ρn (f)] − π(f) + Var[ρn (f)]. (22) Our analysis is carried out under the following two assumptions. (H1) [Lipschitz continuity] The potential U is differentiable and ∇U is Lip- schitz continuous, that is, there exists L < ∞ such that

d |∇U(x) − ∇U(y)| ≤ L|x − y|, x, y ∈ R .

(H2) [Convexity] The potential U has the form U(x) = U0(x) + ∆(x) where 2 d ∆ has a compact support and U0 ∈ C (R ) with

2 2 2 m|x| ≤ hD U0(x), xi ≤ M|x| , x ∈ R for some constants m, M > 0.

Let π be the probability measure on Rd with density π(x) of the form (3); for γ > 0, define the Markov kernel Qγ associated to one step of the ULA algorithm by Z 1  1  Q (x, A) = exp − |y − x + γ∇U(x)|2 dy (23) γ d/2 A (2πγ) 2γ D. Belomestny et al/Variance reduction for MCMC algorithms 12

d x for any A ∈ B(R ). Note that for a homogeneous Markov chain Xp of p∈N0 x n the form (2) with constant step size γ, P(Xn ∈ A) = Qγ (x, A). Due to the N martingale transform structure of MK,n(f), we have

 N  E MK,n(f) = 0.

N N Hence both estimators πn (f) and πK,n(f) have the same bias. Under assump- tions (H1) and (H2), the corresponding Markov chain has a unique stationary distribution πγ which is different from π. From [11, Theorem 10], it follows that there exist γ0 > 0 and C < ∞ such that for all γ ∈ (0, γ0], we get √ kπ − πγ kTV ≤ C γ (24)

Let us now derive an upper bound for the variance of the classical estimator (12) for ULA-based chain. Define

V (x) = 1 + |x|2 . (25)

The following lemma follows from Lemma 64 in Appendix. Lemma 8 Assume (H1) and (H2). Let f be a bounded Borel function and x (Xp )p∈N0 be a Markov chain generated by ULA with constant step size γ. Then, d there exist C < ∞ and γ0 > 0, such that for all n ∈ N, γ ∈ (0, γ0] and x ∈ R , ! 1 V (x) + 12 Var πN (f) ≤ C + . (26) n nγ nγ

Remark 9 The bound in Lemma8 is sharp and cannot be improved even in the case of a normal distribution π. Indeed, ULA for the standard normal dis- tribution takes form  γ  √ X = 1 − X + γξ , ξ ∼ N (0, 1). k+1 2 k k+1 k+1

Thus, setting X0 = 0, we obtain   " n # n i γ 1 X 1 X X √  γ i−j 4 16(1 − ) Var [π0 (f)] = Var X = Var γ 1 − ξ ≥ − 2 . 0 n 0 n i n2  2 j nγ n2γ2 i=1 i=1 j=1

One of the main results of this paper is the following upper bound for the functions (Aq,k) in (21). Theorem 10 Fix some K ≥ 1 and suppose that a bounded function f and µ = ∇U are K × d times continuously differentiable and for all x ∈ Rd and k satisfying 0 < kkk ≤ K,

k k |∂ f(x)| ≤ Bf , |∂ µ(x)| ≤ Bµ . (27) D. Belomestny et al/Variance reduction for MCMC algorithms 13

Moreover assume that

∞ p X Y 2 x max γ (I − γD U(Xk−1)) ≤ Ξ(x), r = 2, 4, (28) j≥N p=j+1 k=j+1 r for some Ξ(x) > 0 not depending on γ, where for any random matrix A we r 1/r define kAkr = (E|A| ) . Then, there exists a constant CK < ∞ such that

2 Aq,k(x) ≤ CK Ξ(x) , 1 ≤ |k| ≤ K, q = 0, . . . , n − 1, (29) and X X γ |I|K−1 A2 (x) ≤ C Ξ(x) , (30) q,k K 2 kkk≥K+1 I⊆{1,...,d},I6=∅ where the sum in (29) runs over all nonempty subsets I of the set {1, . . . , d}.

Corollary 11 Suppose that (H2) holds with ∆ ∈ C2(Rd) satisfying |D2∆(x)| ≤ D for all x ∈ Rd and some constant D > 0. Then for γ < 1/M we have

∞ p X Y 2 x 1 D γ (I − γD U(X )) ≤ + . (31) k−1 m m2(1 − γM) p=j+1 k=j+1 r

Hence under assumptions of Theorem 10, there exists a constant CeK < ∞ such that for all n ∈ N and x ∈ Rd,

 N  −1 K−1 Var πK,n(f) ≤ CeK n γ . (32)

Proof We have

p p Y 2  Y 2  I − γD U(Xk−1) = I − γD U0(Xk−1) + k=j+1 k=j+1 p l−1 p X Y 2  2 Y 2  γ I − γD U0(Xk−1) D ∆(Xl−1) I − γD U0(Xk−1) l=j+1 k=j+1 k=l+1 where empty products are equal 1 by definition. Hence

p   p Y 2  γ(p − j)D Y 2 I − γD U(Xk−1) ≤ 1 + I − γD U0(Xk−1) 1 − γM k=j+1 k=j+1  γ(p − j)D  ≤ 1 + (1 − γm)p−j 1 − γM and (31) follows.  D. Belomestny et al/Variance reduction for MCMC algorithms 14

Let us sketch the main steps of the proof. First using integration by parts, we prove that

p 0 |k0|/2 (k − k )! h k0 i Aq,k(x) = γ √ E ∂ F (x, Z1,...,Zq)Hk−k0 (Z1) , (33) k! Z1

0 where F (x, Z ,...,Z ) := Pq f(Xx) and ∂k stands for a weak partial q 1 q r=0 r Z1 derivative of the functional Fs that also can be viewed as discretised version of Malliavin derivative. Now by taking k0 = k, we get

q ! X A2 (x) ≤ γ|k|Var ∂k f Xx , q = 0, . . . , n − 1. q,k Z1 p p=0

Also from (33) we can derive that

q ! X X γ |I|K X A2 (x) ≤ Var ∂KI f Xx q,k 2 Z1 p kkk≥K+1 I⊆{1,...,d},I6=∅ p=0 where KI = K(1I (1),..., 1I (d)). Finally, using the Gaussian Poincare inequality, we show that under assumptions (27) and (28) it holds

q ! X Var ∂k f Xx ≤ C γ−1Ξ(x), q = 0, . . . , n − 1, (34) Z1 p K p=1 for any kkk ≤ K and some positive constants κ,CK not depending on n and γ.

6. Numerical analysis

In this section we illustrate the performance of the proposed variance reduced ULA algorithm for various potential functions U. In our we implementation we follow the next steps. First we construct a polynomial approximation for each Qr in the form:

X s d Qbr(x) = βbsx , s = (s1, . . . , sd) ∈ N0. ksk≤m

The coefficients βbs ∈ R are obtained via a modified least-squares criteria based  (s) (s)  T on T independent training trajectories X1 ,...,XN+n s=1. More precisely we define Qb0(x) = f(x) and for 1 ≤ r ≤ n − 1, we set

T 2 X (s) (s) Qbr = arg min f(XN+r) − ψ(XN ) , (35) ψ∈Ψm s=1 D. Belomestny et al/Variance reduction for MCMC algorithms 15

P s where Ψm is a class of polynomials ψ(x) = ksk≤m αsx with αs ∈ R. Then using the identity

bj/2c j X 1 ξ = j! Hj−2r(ξ), ξ ∈ , r p R r=0 2 r! (j − 2r)! we obtain a closed-form expression for the estimates bar,k of functionsa ¯r,k in (11). Namely, for all x = (x1, . . . , xd) ∈ Rd and µ = ∇U, we have Z √ bar,k(x) = Hk(z)Qbr(x − γµ(x) + γz)ϕ(dz) d X Y i = βs Pki,si (x − γµi(x)), (36) ksk≤m i=1 where for any integer k, s Pk,s is a one-dimensional polynomial of degree at most s with analytically known coefficients. We estimate Qr only for r < ntrunc where the truncation level ntrunc may depend on d and γ. It allows us to use a smaller amount of training trajectories to approximate Qr(x). Finally, we construct a truncated version of the estimator (15):

πN (f) = πN (f) − M N (f) K,n,ntrunc n cK,n,ntrunc where

N+n  p  N 1 X X X 1 McK,n,n (f) =  ap−l,k(Xl−1)Hk(ξl) {|p − l| < ntrunc} . trunc n b p=N+1 0

6.1. Comparison with vanilla ULA

In this subsection we aim at comparing the degree of variance reduction relative to costs, achieved by the proposed algorithm as compared to the vanilla ULA algorithm. We consider samples, generated by ULA with π being either the - dard normal distribution in dimension d or the mixture of two d−dimensional standard Gaussians of the form

1  2 2  π(x) = e−(1/2)kx−µk + e−(1/2)kx+µk 2p(2π)d where d = 2 and µ = (0.5, 0.5). For both examples, our goal is to estimate π(f) Pd Pd 2 with f(x) = i=1 xi and f(x) = i=1 xi . We used a constant step size γ = 0.1 and sampled T = 5 × 104 independent training trajectories, each one with the burn-in period nburn = 100. Then we solve the least squares problems (35) using Pd the first order polynomial approximations for f(x) = i=1 xi and second order Pd 2 polynomial approximations for f(x) = i=1 xi as described in the previous D. Belomestny et al/Variance reduction for MCMC algorithms 16

Pd Pd 2 section. We set K = 1 or K = 2 for f(x) = i=1 xi or f(x) = i=1 xi , respectively. Then we estimate the cost-to-variance ratio (degree of variance reduction relative to costs)

cost(πN )Var[πN (f)] R(K, N, n, n ) = n n (37) trunc cost(πN )Var[πN (f)|D] bK,n bK,n,ntrunc by its empirical counterpart, computed over 24 independent trajectories with 3 3 N n = 2 × 10 and N = 2 × 10 . Since for a fixed K the cost of computing πn (f) is proportional to the cost of computing function f, we set for K = 1

N N cost(πbK,n) = cost(πn ) × ntrunc × 2, since there are only 2 non-zero functions among ar,k for a fixed r. Note that in this case each ar,k is a linear function, which can be computed at the same cost as f. Similarly, for K = 2 we set

N N cost(πbK,n) = cost(πn ) × ntrunc × 8, since there are at most 8 non-zero functions among ar,k for a fixed r. Variance reduction costs for Gaussian potential and different ntrunc are summarized in Figure1, and for the Gaussian mixture - in Figure2. Note that for both examples the proposed algorithm allows us to obtain a significant gain in efficiency.

Pd Pd 2 (a) f(x) = i=1 xi (b) f(x) = i=1 xi

Fig 1: Cost-to-variance ratios (37) as functions of the truncation level ntrunc for a two-dimensional standard Gaussian distribution and different test functions.

6.2. Gaussian mixtures

We consider ULA with π given by the mixture of two equally-weighted d- dimensional Gaussian distributions of the following form

1  T −1 T −1  π(x) = e−(1/2)(x−µ) Σ (x−µ) + e−(1/2)(x+µ) Σ (x+µ) (38) 2p(2π)d|Σ| D. Belomestny et al/Variance reduction for MCMC algorithms 17

Pd Pd 2 (a) f(x) = i=1 xi (b) f(x) = i=1 xi

Fig 2: Cost-to-variance ratios (37) as functions of the truncation level ntrunc for the mixture (38) of two-dimensional Gaussian distributions and two test functions. where µ ∈ Rd, Σ is a positive-definite d × d matrix and |Σ| is its determinant. The function U(x) and its gradient are given by

1  > −1  U(x) = (x − µ)T Σ−1(x − µ) − ln 1 + e−2µ Σ x 2 and  > −1 −1 ∇U(x) = Σ−1(x − µ) + 2 1 + e2µ Σ x Σ−1µ, respectively. In our experiments we considered d = 2, µ = (0.5, 0.5) and ran- domly generated positive-definite matrix Σ (heterogeneous structure). In order Pd Pd 2 to approximate the expectation π(f) with f(x) = i=1 xi or f(x) = i=1 xi , we used a constant step size γ = 0.1 and sampled T = 5×104 independent train- ing trajectories, each one of size n = 50 with the burn-in period nburn = 100. Then we solve the least squares problems (35) using the first order polyno- Pd mial approximations for f(x) = i=1 xi and second order approximations for Pd 2 f(x) = i=1 xi as described in the previous section. Also we set K = 1 for Pd Pd 2 f(x) = i=1 xi or K = 2 for f(x) = i=1 xi . We set the truncation level ntrunc = 50. To test our variance reduction algorithm, we generated 100 inde- pendent trajectories of length n = 2×103 and plot boxplots of the corresponding ergodic averages in Figure3. Our approach (MDCV, Martingale decomposition control variates) is compared to other variance reduction methods of [20] and [2]. In the baselines we used the first order polynomials in case of K = 1 and second order polynomials for K = 2.

6.3. Banana shape distribution

The Banana-shape distribution, proposed by [15], can be obtained from a d- dimensional Gaussian vector with zero mean and covariance diag(p, 1,..., 1) by D. Belomestny et al/Variance reduction for MCMC algorithms 18

Pd Pd 2 (a) f(x) = i=1 xi (b) f(x) = i=1 xi Fig 3: Boxplots of ergodic averages obtained by the variance reduced ULA al- gorithm for the Gaussian mixture model. The compared estimators are the ordinary empirical average (Vanilla), our method of martingale control variates (MDCV), zero variance estimator (ZV) and the control variates approach based on the empirical spectral variance minimisation (ESVM)

d d applying transformation ϕb(x): R → R of the form

2 ϕb(x1, . . . , xn) = (x1, x2 + bx1 − pb, x3, . . . , xd) where p > 0 and b > 0 are parameters; b accounts for the curvature of densitys level sets. The potential U is given by

2 2 2 d x (x2 + bx − pb) 1 X U(x , . . . , x ) = 1 + 1 + x2 . 1 d 2 2 2 k k=3

The quantity of interest is the expectation of f(x) = x2. We set p = 100, b = 0.1 and consider d = 8. We solved the least squares problems (35) using the second-order polynomial approximations for the coefficientsa ˆp,k as described in the previous section, we also set K = 2. We used T = 5 × 104 independent training trajectories, each of size n = 50 with the burn-in nburn = 100. We set the truncation level ntrunc = 50. To test our variance reduction algorithm, we generated 100 independent trajectories of length n = 2 × 103 and display boxplots of the ergodic averages in Figure4.

6.4. Binary Logistic Regression

m In this section, we consider logistic regression. Let Y = (Y1,..., Yn) ∈ {0, 1} be binary response variables, X ∈ Rm×d be a feature matrix and θ ∈ Rd - vector of regression parameters. We define log-likelihood of i-th observation as

T T Xi θ `(Yi|θ, Xi) = YiXi θ − ln (1 + e ) D. Belomestny et al/Variance reduction for MCMC algorithms 19

Fig 4: Boxplots of ergodic averages from the variance reduced ULA algorithms for the Banana shape density. The compared estimators are the ordinary empir- ical average (Vanilla), our estimator of control variates (MDCV), zero variance estimator (ZV) control variates obtained with empirical spectral variance min- imisation (ESVM).

In order to estimate θ according to given data, the Bayesian approach introduces prior distribution π0(θ) and consider the posterior density π(θ|Y, X) using Bayes’ rule. 2 In the case of Gaussian prior π0(θ) ∼ N (0, σ Id), the unnormalized posterior density takes the form:

( m ) X 1 2 π(θ|Y, X) ∝ exp YT Xθ − ln 1 + eXiθ − kθk , 2σ2 2 i=1 Thus we obtain m X  T  1 2 U(θ) = −YT Xθ + ln 1 + eθ Xi + kθk , 2σ2 2 i=1 m T X Xi 1 ∇U(θ) = −Y X + T + θ. 1 + e−θ Xi σ2 i=1 To demonstrate the performance of the proposed control variates approach in the above Bayesian logistic regression model, we take a simple dataset from [20], which contains the measurements of four variables on m = 200 Swiss banknotes. Prior distribution of the regression parameter θ = (θ1, θ2, θ3, θ4) assumed to 2 2 be normal with the covariance matrix σ I4, where σ = 100. To construct trajectories of length n = 1 × 104, we take step size γ = 0.1 for the ULA scheme with N = 103 burn-in steps. As in the previous experiment we used the first order polynomials approximations to analytically compute the coefficients aˆp−l,k, based on T = 10 training trajectories. We put ntrunc = 50 and K = 1. D. Belomestny et al/Variance reduction for MCMC algorithms 20

Fig 5: Boxplots of ergodic averages from the variance reduced ULA algorithms for the Logistic Regression. The compared estimators are the ordinary empiri- cal average (Vanilla), our estimator of control variates (MDCV), zero variance estimator (ZV) control variates obtained with empirical spectral variance min- imisation (ESVM).

Pd The target function is taken to be f(θ) = i=1 θi. In order to test our variance reduction algorithm, we generate 100 independent test trajectories of length n = 103. In Figure5 we compare our approach to the variance reduction methods of [20] and [2]. In the baselines we use the first order polynomials for the same reasons as in the Gaussian mixtures example.

7. Proofs

7.1. Proof of Theorem4

The expansion obviously holds for p = q = 1 and j = 0. Indeed, since (φk)k≥0 2 d 2 is a complete orthonormal system in L (R ,Pξ), it holds in L (P) that

x x X f(X1 ) = E[f(X1 )] + a1,1,k(x)φk(ξ1) k≥1

x for any bounded f with a1,1,k(x) = E[f(X1 )φk(ξ1)]. Assume now that (7) holds for some m < p, any q ≤ m, j < q ≤ m and all bounded f. Let us prove that the induction assumption holds for q = m+1 and all j < q. The orthonormality and ∞ d completeness of the system (φk)k=0 implies that for any bounded f and y ∈ R , y 2 limn→∞ Ψn,m+1,1(y) = 0 with Ψn,m+1,1(y) = E[|f(Xm,m+1)−fn,m+1,1(y)| ] and

n y X y fn,m+1,1(y) = E[f(Xm,m+1)] + E[f(Xm,m+1)φk(ξm+1)]φk(ξm+1). k=1 D. Belomestny et al/Variance reduction for MCMC algorithms 21

d Note that for any y ∈ R and n ∈ N it holds Ψn,m+1,1(y) ≤ Ψ0,m+1,1(y) and h h Xx Xx ii x x m m 2 x E [ψ0,m+1,1(Xm)] = E [E [ψ0,m+1,1(Xm)| Gm]] = E E |f(Xm,m+1) − Ef(Xm,m+1)| Xm  x x 2 x = E |f(Xm+1) − Ef(Xm+1)| = Varf(Xm+1) < ∞ x Hence, by Lebesgue dominated convergence theorem, limn→∞ E [ψn,m+1,1(Xm)] = d y  y  0 and since for all y ∈ R the expectation E[f(Xm,m+1)] is a version of E f(Xm,m+1)|Gm , it holds in L2(P) that ∞ x x X x f(Xm+1) = E[f(Xm+1)|Gm] + am+1,m+1,k(Xm)φk(ξm+1) (39) k=1 where  y  am+1,m+1,k(y) = E f(Xm,m+1)φk (ξm+1) which is the required statement in the case q = m + 1 and j = m. Now assume  y  x that j < m. Set g(y) = E f(Xm,m+1) . Note that P-a.s. it holds g(Xm) = x E [f(Xm)| Gm] and g is bounded by construction. Hence the induction hypothesis applies, and we get m  x   x  X X x E f(Xm+1) Gm = E f(Xm+1) Gj + am+1,l,k(Xl−1)φk(ξl) (40) k≥1 l=j+1 with x   x    x x  am+1,l,k(Xl−1) = E E f(Xm+1) Gm φk(ξl) Gl−1 = E f(Xm+1)φk(ξl) Xl−1 . where for y ∈ Rd, h y i am+1,l,k(y) = E f(Xl−1,m+1)φk(ξl) Formulas (39) and (40) conclude the induction step for q = m + 1 and all j < q and hence the proof.

7.2. Proof of Lemma8

In the sequel, the notation A(γ, n, x) . B(γ, n, x) means that there exist γ0 > 0, d and C < ∞ such that for all γ ∈ (0, γ0], x ∈ R , n ∈ N, A(γ, n, x) ≤ CB(γ, n, x). x Under assumptions (H1) and (H2) the Markov chain (Xp )p∈N0 generated by ULA is V −geometrically ergodic in the sense of (58) with V (x) = 1 + |x|2 due to Lemma 19. It follows from Lemma 22, applied with ργ that 2 1 ργN V (x) + 1 Var[πN (f)] + n . n(1 − ργ/2) n(1 − ργ/2) Since 1 − ργ/2 ≥ Cγ with a constant C not depending on γ, N and n, we end up with an estimate 2 1 ρN V (x) + 1 Var[πN (f)] + . n . nγ nγ D. Belomestny et al/Variance reduction for MCMC algorithms 22

7.3. Proof of Theorem 10

For l ≤ p and x ∈ Rd, we have the representation

x √ √ Xp = Gp(x, γZ1,..., γZp),

d×(p+1) d where the function Gp : R → R is defined as

Gp(x, y1, . . . , yp) := Φ(·, yp) ◦ Φ(·, yp−1) ◦ ... ◦ Φ(x, y1) (41) with, for x, y ∈ Rd, Φ(x, y) = x − γµ(x) + y. As a consequence, for a function f : Rd → R as in Section2, we have √ √ f (Xp) = [f ◦ Gp](X0, γZ1,..., γZp).

d In what follows, for k ∈ N0, we use the shorthand notation

k k √ √ ∂1 f (Xp) := ∂1 [f ◦ Gp](X0, γZ1,..., γZp) (42) whenever the function f ◦ Gp is smooth enough (that is, f and µ need to be d smooth enough). Finally, for a multi-index k = (ki) ∈ N0, we use the notation k! := k1! · ... · kd! 0 d 0 0 Lemma 12 For any k, k ∈ N0 such that k ≤ k componentwise and kk k ≤ K, the following representation holds

 0 1/2 |k0| (k − k )! h k0 x i a¯ (x) = γ E ∂ f(X )H 0 (Z ) , p,k k! 1 p k−k 1 where a¯p,k is defined in (11). d d Proof Note that for the normalized Hermite polynomial Hk on R , k ∈ N0, it holds |k| (−1) k Hk(z)ϕ(z) = √ ∂ ϕ(z). k! This enables to use the integration by parts in vector form as follows (below Qp j=l+1 := 1 whenever l = p)

p Z Z √ √ Y a¯p,k(x) = ... f ◦ Gp(x, γz1,..., γzp)Hk(z1)ϕ(z1) ϕ(zj) dz1 ... dzp d d R R j=2 Z Z p 1 √ √ |k| k Y = √ ... f ◦ Gp(x, γz1,..., γzp)(−1) ∂ ϕ(z1) ϕ(zj) dz1 ... dzp d d k! R R j=2 |k0|/2 Z Z p γ k0 √ √ |k−k0| k−k0 Y = √ ... ∂1 [f ◦ Gp](x, γz1,..., γzp)(−1) ∂ ϕ(z1) ϕ(zj) dz1 ... dzp d d k! R R j=2 p 0 |k0|/2 (k − k )! h k0 √ √ i = γ √ E ∂ [f ◦ Gp](x, γZ1,..., γZp)Hk−k0 (Z1) . k! y1 D. Belomestny et al/Variance reduction for MCMC algorithms 23

The last expression yields the result. 

0 d 0 0 0 For multi-indices k, k ∈ N0 with k ≤ k componentwise and k 6= k, kk k ≤ K, we get applying first Lemma 12,

1/2  0  s |k0| (k − k )! hX k0 x k0 x i As,k(x) = γ E {∂1 f(Xr ) − E[∂1 f(Xr )]}Hk−k0 (Z1) k! r=1 where As,k is defined in (14). Assume that µ and f are K ×d times continuously d 0 0 1 1 differentiable. Then, given k ∈ N0, by taking k = k (k) = K( {k1>K} ..., {kd>K}), we get

 0  X X 0 (k − k )! A2 (x) = γ|k | Q (k0, k − k0) s,k k! s k : kkk≥K+1 k : kkk≥K+1        X |I|K X mI !   X  = γ Qs(KI , mI + mIc ) , (mI + KI )! I⊆{1,...,d},I6=∅ d   c d c  mI ∈NI mI ∈N0,Ic , kmI k≤K  (43)

d where for any two multi-indices r, q from N0

( " s #)2 X  r x  r x Qs(r, q) = E ∂1f Xp − E ∂1f Xp Hq(Z1) . p=1

In (43) the first sum runs over all nonempty subsets I of the set {1, . . . , d}. For d any subset I, NI stands for a set of multi-indices mI with elements mi = 0, c d i 6∈ I, and mi ∈ N, i ∈ I. Moreover, I = {1, . . . , d}\I and N0,Ic stands for a set of multi-indices mIc with elements mi = 0, i ∈ I, and mi ∈ N0, i 6∈ I. Finally, the multi-index KI is defined as KI = (K1{1∈I},...,K1{d∈I}). Applying the estimate m ! I ≤ (1/2)|I|K , (mI + KI )! we get

X 2 X |I|K As,k(x) ≤ (γ/2) (44) k : kkk≥K+1 I⊆{1,...,d},I6=∅ X X × Q(KI , mI + mIc )

d c d c mI ∈NI mI ∈N0,Ic , kmI k≤K X |I|K X ≤ (γ/2) Q(KI , m). I⊆{1,...,d},I6=∅ d m∈N0 D. Belomestny et al/Variance reduction for MCMC algorithms 24

The Parseval identity implies that for any function ϕ : Rd → R satisfying 2 E[ϕ (Z1)] < ∞,

X 2 2 {E[ϕ(Z1)Hm(Z1)]} ≤ E[{ϕ(Z1)} ] d m∈N0 Using this identity in (44) implies

s ! X X γ |I|K X A2 (x) ≤ Var ∂KI f Xx s,k 2 1 p k : kkk≥K+1 I⊆{1,...,d},I6=∅ p=1 Next we show that under the conditions of Theorem 10 q ! X KI x −1 Var ∂1 f Xp ≤ Cγ Ξ(x), q = 1, . . . , n, p=1 for all x and some constants C, κ > 0 not depending on n and γ. To keep the notational burden at a reasonable level, we present the proof only in one- dimensional case. Multidimensional extension is straightforward but requires involved notations. First, we need to prove several auxiliary results.

Lemma 13 Let (xp)p∈N0 and (p)p∈N be sequences of nonnegative real numbers satisfying x0 = C0 and

0 ≤ xp ≤ αpxp−1 + γp, p ∈ N, (45) p Y 2 0 ≤ p ≤ C1 αk, p ∈ N, (46) k=1 where αp, γ ∈ (0, 1), p ∈ N, and C0, C1 are some nonnegative constants. Assume

∞ r X Y γ αk ≤ C2 (47) r=1 k=1 for some constant C2. Then p Y xp ≤ (C0 + C1C2) αk, p ∈ N. k=1 Proof Applying (45) recursively, we get

p p p Y X Y xp ≤ C0 αk + γ r αk, k=1 r=1 k=r+1 Qp where we use the convention k=p+1 := 1. Substituting estimate (46) into the right-hand side, we obtain

p r ! p X Y Y xp ≤ C0 + C1γ αk αk, r=1 k=1 k=1 D. Belomestny et al/Variance reduction for MCMC algorithms 25 which, together with (47), completes the proof. 

In what follows, we use the notation

0 x αk = 1 − γµ (Xk−1), k ∈ N. (48) 0 The assumption (27) implies that |µ (x)| ≤ Bµ for some constant Bµ > 0 and d all x ∈ R . Without loss of generality we suppose that γBµ < 1. Lemma 14 Under assumptions of Theorem 10, for all natural r ≤ K and l ≤ p, there exist constants Cr (not depending on l and p) such that p Y ∂r Xx ≤ C (1 − γµ0(Xx )) a.s. (49) yl p r k−1 k=l+1 where ∂r Xx is defined in (42). Moreover, we can choose C = 1. yl p 1 Lemma 15 Under assumptions of Theorem 10, for all natural r ≤ K, j ≥ l and p > j, we have p Y ∂ ∂r Xx ≤ c (1 − γµ0(Xx )), a.s. (50) yj yl p r k−1 k=l+1 with some constants cr not depending on j, l and p, while, for p ≤ j, it holds ∂ ∂r Xx = 0. yj yl p Proof The last statement is straightforward. We fix natural numbers j ≥ l and prove (50) for all p > j by induction in r. First, for p > j, we write x  0 x  x ∂yl Xp = 1 − γµ (Xp−1) ∂yl Xp−1 and differentiate this identity with respect to yj x  0 x  x 00 x x x ∂yj ∂yl Xp = 1 − γµ (Xp−1) ∂yj ∂yl Xp−1 − γµ (Xp−1)∂yj Xp−1∂yl Xp−1. By Lemma 14, we have

p−1 p−1 x x Y Y |∂yj ∂yl Xp | ≤ αp|∂yj ∂yl Xp−1| + γBµ αk αk k=l+1 k=j+1 j p x Y Y 2 ≤ αp|∂yj ∂yl Xp−1| + γconst αk αk, p ≥ j + 1, k=l+1 k=j+1

x with a suitable constant. By Lemma 13 applied to bound |∂yj ∂yl Xp | for p ≥ x j + 1 (notice that ∂yj ∂yl Xj = 0, that is, C0 in Lemma 13 is zero, while C1 Qj in Lemma 13 has the form const k=l+1 αk), we obtain (50) for r = 1. The induction hypothesis is now that the inequality p Y ∂ ∂k Xx ≤ c α (51) yj yl p k s s=l+1 D. Belomestny et al/Variance reduction for MCMC algorithms 26 holds for all natural k < r (≤ K) and p > j. We need to show (51) for k = r. Fa`adi Bruno’s formula implies for 2 ≤ r ≤ K and p > l ∂r Xx = 1 − γµ0(Xx ) ∂r Xx (52) yl p p−1 yl p−1 r−1 !mk r! ∂k Xx X (m1+...+mr−1) x Y yl p−1 − γ µ (Xp−1) , m1! . . . mr−1! k! k=1 where the sum is taken over all (r−1)-tuples of nonnegative integers (m1, . . . , mr−1) satisfying the constraint

1 · m1 + 2 · m2 + ... + (r − 1) · mr−1 = r. (53) Notice that we work with (r − 1)-tuples rather than with r-tuples because the term containing ∂r Xx on the right-hand side of (52) is listed separately. For yl p−1 p > j, we then have ∂ ∂r Xx = 1 − γ µ0(Xx ) ∂ ∂r Xx − γµ00(Xx )∂r Xx ∂ Xx (54) yj yl p p p−1 yj yl p−1 p−1 yl p−1 yj p−1 r−1 !mk r! ∂k Xx X (m1+...+mr−1+1) x x Y yl p−1 − γ µ (Xp−1)∂yj Xp−1 m1! . . . mr−1! k! k=1 "r−1 !mk # r! ∂k Xx X (m1+...+mr−1) x Y yl p−1 − γ µ (Xp−1)∂yj m1! . . . mr−1! k! k=1 = 1 − γµ0(Xx ) ∂ ∂r Xx + γ , p−1 yj yl p−1 l,j,p where the last equality defines the quantity l,j,p. Furthermore, "r−1 !mk # r−1 ∂k Xx ∂s Xx ms−1 Y yl p−1 X ms yl p−1 s x ∂y = ∂y ∂ X j k! s! s! j yl p−1 k=1 s=1 m k x ! k Y ∂y Xp−1 × l . k! k≤r−1, k6=s

Using Lemma 14, induction hypothesis (51) and the fact that m1+...+mr−1 ≥ 2 for (r − 1)-tuples of nonnegative integers satisfying (53), we can bound |l,j,p| as follows p−1 p−1  p−1  Y Y X r! Y |l,j,p| ≤ BµCr αk αk + Bµ  αk m1! . . . mr−1! k=l+1 k=j+1 k=j+1 r−1 Qp−1 !ms Y Cs αk × k=l+1 s! s=1 r−1 Qp−1 !mt−1 " p−1 # X r! X mt Ct k=l+1 αk Y + Bµ ct αk m1! . . . mr−1! t! t! t=1 k=l+1 Qp−1 !ms j p Y Cs αk Y Y × k=l+1 ≤ const α α2 s! k k s≤r−1, s6=t k=l+1 k=j+1 D. Belomestny et al/Variance reduction for MCMC algorithms 27 with some constant “const” depending on Bµ, r, C1,...,Cr, c1, . . . , cr−1. Thus, (54) now implies

j p Y Y |∂ ∂r Xx| ≤ α |∂ ∂r Xx | + γ const α α2, p ≥ j + 1. yj yl p p yj yl p−1 k k k=l+1 k=j+1 We can again apply Lemma 13 to bound |∂ ∂r Xx| for p ≥ j + 1 (notice that yj yl p ∂ ∂r Xx = 0, that is, C in Lemma 13 is zero, while C in Lemma 13 has the yj yl j 0 1 Qj form const k=l+1 αk), and we obtain (51) for k = r. This concludes the proof. 

Lemma 16 Under assumptions of Theorem 10, it holds

" q # X Var ∂K f Xx ≤ C γ−1Ξ(x), q = 1, . . . , n, (55) y1 p K p=1 where CK is a constant not depending on x, n and γ. Proof The expression Pq ∂K f(Xx) can be viewed as a deterministic function p=1 y1 p of x, Z1,Z2,...,Zq

q X ∂K f(Xx) = F (x, Z ,Z ,...,Z ). y1 p 1 2 q p=1 By the (conditional) Gaussian Poincar´einequality, we have

" q # X Var ∂K f Xx ≤ E |∇ F (x, Z ,Z ,...,Z )|2 , y1 p x Z 1 2 q p=1 √ where ∇Z F = (∂Z1 F, . . . , ∂Zq F ). Notice that ∂Zj F = γ ∂yj F and hence

" q # q  q !2 X X X Var ∂K f Xx ≤ γ2 E ∂ ∂K f Xx . y1 p  yj y1 p  p=1 j=1 p=1

It is straightforward to check that ∂ ∂K f(Xx) = 0 whenever p < j. Therefore, yj y1 p we get

 2 " q # q  q  X K x 2 X X K x Var ∂ f X ≤ γ E  ∂y ∂ f X  . (56) y1 p  j y1 p   p=1 j=1 p=j

Now fix p and j, p ≥ j, in {1, . . . , q}. By Fa`adi Bruno’s formula

m K k x ! k X K! Y ∂y Xp ∂K f Xx = f (m1+...+mK )(Xx) 1 , y1 p p m1! . . . mK ! k! k=1 D. Belomestny et al/Variance reduction for MCMC algorithms 28 where the sum is taken over all K-tuples of nonnegative integers (m1, . . . , mK ) satisfying 1 · m1 + 2 · m2 + ... + K · mK = K. Then m K k x ! k X K! Y ∂y Xp ∂ ∂K f Xx = f (m1+...+mK +1)(Xx) ∂ Xx 1 yj y1 p p yj p m1! . . . mK ! k! k=1 K  s x ms−1 X K! X ms ∂y Xp + f (m1+...+mK )(Xx) 1 m ! . . . m ! p s! s! 1 K s=1 !mk ∂k Xx  s x Y y1 p × ∂y ∂ X . j y1 p k! k≤K, k6=s

Using the bounds of Lemmas 14 and 15, we obtain

p Y ∂ ∂K f Xx ≤ A α (57) yj y1 p K k k=2 with a suitable constant AK . Substituting this in (56), we proceed as follows

2 " q # q  q p  X X X Y Var ∂K f Xx ≤ γ2A2 E α x y1 p K  k p=1 j=1 p=j k=2  2 2 2 q q+1 p γ AK X X Y ≤ 2 E  αk (1 − γBµ) j=1 p=j+1 k=2  2 2 2 q j q+1 p γ AK X Y X Y ≤ 3 E αk  αk (1 − γBµ) j=1 k=1 p=j+1 k=j+1

p 1 Now, from the H¨olderinequality, we obtain (with kXkp = (EX ) p )

  2 2 q j q+1 p q j q+1 p X Y X Y X Y X Y E  αk  αk  ≤ αk αk   j=1 k=l p=j+1 k=j+1 j=1 k=1 p=j+1 k=j+1 2 4  2 q j q+1 p X Y X Y ≤ αk  αk  .

j=1 k=1 p=j+1 k=j+1 2 4

Now assumption (28) implies (55).  D. Belomestny et al/Variance reduction for MCMC algorithms 29

Appendix A: Geometric ergodicity of the Unadjusted Langevin Algorithm

Definition 1 Let (Xp)p∈N0 be a Markov chain taking values in some space X with the Markov kernel Q and stationary distribution π. We say that (Xp)p∈N0 is V −geometrically ergodic for a given function V : X → [1; +∞) if there exist real numbers C > 0 and 0 < ρ < 1 such that for any n ∈ N, n n dV (δxQ , π) ≤ Cρ V (x) (58)

In the sequel, without loss of generality, we assume that (Xp)p∈N0 follows ULA dynamics (2) with ∇U(0) = 0 and set

V (x) = 1 + |x|2 . (59)

Lemma 17 Assume (H1) and (H2). Then there exists K2 ≥ 0 such that for 2 any |x| ≥ K2 it holds h∇U(x), xi ≥ (m/2)|x| . Proof Suppose that the function ∆ in (H2) is supported in the ball {x : |x| ≤ K1}, then we have

Z K1/|x| Z 1 h∇U(x), xi = D2U(tx)[x⊗2]dt + D2U(tx)[x⊗2]dt 0 K1/|x| 2 ≥ m|x| {1 − K1(1 + L/m)/|x|} , for all |x| ≥ K1, which proves the statement with K2 = 2K1(1 + L/m). 

Lemma 18 Assume (H1) and (H2). Then, for any γ ∈ (0, γ¯] with γ = m/2L2 the kernel Qγ from (23) satisfies drift condition

γ Qγ V (x) ≤ λ V (x) + γb (60)

m  2 with λ = exp − 2 , b = 2K2 (L + m) + d + m with K2 from Lemma 17, and m from (H2). Proof Note first that

2 Qγ V (x) = 1 + |x − γ∇U(x)| + γd

Let us first consider the case x∈ / B(0,K2). It remains to notice that, by Lemma 17, we get

|x − γ∇U(x)|2 = |x|2 − 2γh∇U(x), xi + γ2|∇U(x)|2 ≤ 1 − γm + γ2L2 |x|2

m Since γ < 2L2 , we obtain  γm Q V (x) ≤ (1−γm+γ2L2)V (x)+γ(d+m−γL2) ≤ exp − V (x)+γ(d+m) γ 2 D. Belomestny et al/Variance reduction for MCMC algorithms 30

2 Consider now the case x ∈ B(0,K2). Then simply using |x − γ∇U(x)| ≤ (1 + Lγ)2|x|2, we obtain

2 2 2 2 2 Qγ V (x) ≤ (1 − γm + γ L )V (x) + γ (m − γL )(1 + |x| ) + d + L(2 + Lγ)|x| ≤  γm ≤ exp − V (x) + γ(2K2(L + m) + d + m) 2 2 

It is known that under assumption (H1) the Markov chain generated by ULA with constant step size γ would have unique stationary distribution πγ , which is different from π. Yet this chain will be V −geometrically ergodic due to [10, Theorem 19.4.1]. Namely, the following lemma holds: Lemma 19 Assume (H1) and (H2). Then for 0 < γ < γ = min(m/2L2, 1/L), and for any x ∈ Rd it holds n γn dV (δxQγ , πγ ) ≤ Cρ V (x) (61) with V (x) = 1 + |x|2 and some 0 < ρ < 1 and C > 0. Proof Note that the condition (H1) implies that for any x, y ∈ Rd, |x − y − γ(∇U(x) − ∇U(y))|2 ≤ (1 + 2Lγ + γ2)|x − y|2

Hence, [5, Corollary 5] implies that for all x, y such that |x|, |y| ≤ K,

d1/γe d1/γe kδxQγ − δyQγ kTV ≤ 2(1 − ε)

 q 1   where d·e denotes an upper integer part and ε = 2Φ −K 1 + L (1 + 2L) . Note also that the drift condition (60) implies that

γb(1 − λγ(d1/γe+1)) b Qd1/γeV (x) ≤ λV (x) + ≤ λV (x) + γ 1 − λγ λ log 1/λ

d1/γe where the last inequality is due to [11, Lemma 1]. Hence, the kernel Qγ satisfies (1, ε)-Doeblin condition on the set {x : V (x) < 1 + K2} and drift condition (60) with appropriate constants. Now the statement of the lemma follows from [10, Theorem 19.4.1]. 

Appendix B: Covariance estimation for V −geometrically ergodic Markov chains

In this section we assume that (Xp)p∈N0 is a V −geometrically ergodic Markov N chain and prove bounds on the variance of ergodic average πn (f) of the form (12). We use the same technique as in [2] to control autocovariances for a given Markov chain. We start from some auxiliary lemmas: D. Belomestny et al/Variance reduction for MCMC algorithms 31

Lemma 20 Let (Xp)p≥0 be a V −geometrically ergodic Markov chain with a stationary distribution π, and let f be a function with kfk 1 < ∞. Set f˜(x) = V 2 f(x) − π(f), then it holds for some constant C > 0,

˜ ˜ s/2 ˜ 2 |Ex[f(X0)f(Xs)]| ≤ 2Cρ kfk 1 V (x), (62) V 2

˜ ˜ s/2 ˜ 2 |Eπ[f(X0)f(Xs)]| ≤ 2Cπ(V )ρ kfk 1 . (63) V 2

Lemma 21 Let (Xp)p∈N0 be a V −geometrically ergodic Markov chain with a stationary distribution π, and let f be a function with kfk 1 < ∞. Set f˜(x) = V 2 f(x) − π(f), then

2 2 2k+s 2 k+s/2 ˜ s/2 ˜ covx [f(Xk), f(Xk+s)] ≤ C V (x)ρ +C ρ kfk 1 V (x)+Cπ(V )ρ kfk 1 V 2 V 2 (64) Proof Note that

covx [f(Xk), f(Xk+s)] = Ex [f(Xk) − π(f)] [f(Xk+s) − π(f)]

+ [π(f) − Exf(Xk+s)] [Exf(Xk) − π(f)] .

Due to V −ergodicity (58),

2 2 2k+s |[π(f) − Exf(Xk+s)] [Exf(Xk) − π(f)]| ≤ C V (x)ρ .

To bound the first term note that

|Ex [f(Xk) − π(f)] [f(Xk+s) − π(f)] − covπ [f(Xk), f(Xk+s)]| Z h i ˜ ˜ k ≤ Ey f(X0)f(Xs) |P (x, dy) − π(dy)| d R 2 s/2 ˜ 2 k 2 k+s/2 ˜ 2 ≤ 2C ρ kfk 1 kP (x, dy) − π(dy)kV ≤ 2C ρ kfk 1 V (x), V 2 V 2 where we have used Lemma 20 and ergodicity of (Xk)k≥0. This implies (64). 

Now we state and prove the main result of this section on the variance bound n for the estimator πN (f) in the case of a V −geometrically ergodic Markov chain.

Lemma 22 Let (Xp)p∈N0 be a V −geometrically ergodic Markov chain with a stationary distribution π, and let f be a function with kfk 1 < ∞. Denote V 2 f˜(x) = f(x) − π(f). Then

˜ 2  2 C1π(V )kfk 1/2 V (x) + 1 Var [πn (f)] ≤ V + C (65) x N n(1 − ρ1/2) 2 n(1 − ρ) with constants C1,C2 > 0 and ρ ∈ (0, 1) from (58). D. Belomestny et al/Variance reduction for MCMC algorithms 32

Proof Note that " N+n # N+n N+n−1 n−k−1 1 X 1 X 2 X X Var f(X ) = Var [f(X )] + cov [f(X ), f(X )] . x n k n2 x k n2 x k k+s k=N+1 k=N+1 k=N+1 s=1 | {z } | {z } S1 S2 Now we bound each summand using Lemma 21:   N 2 2 ˜ 1 ρ C V (x) + V (x)kfkV 1/2 S ≤ Cπ(V )kf˜k 1/2 + , 1 n V n2(1 − ρ)

N+n−1 " 2 2 2k+1 2 ˜ 2 k+1/2 ˜ 2 # 2 X C V (x)ρ 2C V (x)kfk 1/2 ρ 2Ckfk 1/2 π(V ) S ≤ + V + V ≤ 2 n2 1 − ρ 1 − ρ1/2 1 − ρ1/2 k=N+1 2 2N 2 2 N+1/2 ˜ 2 ˜ 2 2C ρ V (x) 4C ρ kfk 1/2 V (x) 4Ckfk 1/2 π(V ) ≤ + V + V n2(1 − ρ)2 n2(1 − ρ)(1 − ρ1/2) n(1 − ρ1/2) Hence (65) follows. 

References

[1] Roland Assaraf and Michel Caffarel. Zero-variance principle for Monte Carlo algorithms. Physical review letters, 83(23):4682, 1999. [2] D. Belomestny, L. Iosipoi, E. Moulines, A. Naumov, and S. Samsonov. Variance reduction for Markov chains with application to MCMC. arXiv preprint arXiv:1910.03643, 2019. [3] Denis Belomestny, Stefan H¨afner,and Mikhail Urusov. Variance reduction for discretised diffusions via regression. Journal of Mathematical Analysis and Applications, 458:393–418, 2018. [4] Tarik Ben Zineb and Emmanuel Gobet. Preliminary control variates to improve empirical regression methods. Monte Carlo Methods Appl., 19(4):331–354, 2013. [5] Valentin De Bortoli and Alain Durmus. Convergence of diffusions and their discretizations: from continuous to discrete processes and back. arXiv preprint arXiv:1904.09808, 2019. [6] Nicolas Brosse, Alain Durmus, Sean Meyn, and Eric Moulines. Diffu- sion approximations and control variates for MCMC. arXiv preprint arXiv:1808.01665, 2018. [7] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017. [8] Petros Dellaportas and Ioannis Kontoyiannis. Control variates for estima- tion based on reversible Markov chain monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(1):133–161, 2012. D. Belomestny et al/Variance reduction for MCMC algorithms 33

[9] Ivan T Dimov. Monte Carlo methods for applied scientists. World Scientific, 2008. [10] R Douc, E Moulines, P Priouret, and P Soulier. Markov Chains. Springer New York, 2018. [11] A. Durmus and E.´ Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Ann. Appl. Probab., 27(3):1551–1587, 2017. [12] Paul Glasserman. Monte Carlo methods in financial engineering, vol- ume 53. Springer Science & Business Media, 2013. [13] Emmanuel Gobet. Monte-Carlo methods and stochastic processes. CRC Press, Boca Raton, FL, 2016. From linear to non-linear. [14] L´aszl´o Gy¨orfi, Michael Kohler, Adam Krzyzak, and Harro Walk. A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006. [15] Heikki Haario, Eero Saksman, and Johanna Tamminen. Adaptive proposal distribution for metropolis algorithm. Computational Statis- tics, 14(3):375–396, 1999. [16] Shane G Henderson. Variance reduction via an approximating Markov pro- cess. PhD thesis, Stanford University, 1997. [17] Shane G. Henderson and Burt Simon. Adaptive simulation using perfect control variates. J. Appl. Probab., 41(3):859–876, 09 2004. [18] Vincent Lemaire. An adaptive scheme for the approximation of dissipative systems. . Appl., 117(10):1491–1518, 2007. [19] K. Mengersen and R. L. Tweedie. Rates of convergence of the Hastings and Metropolis algorithms. Ann. Statist., 24:101–121, 1996. [20] Antonietta Mira, Reza Solgi, and Daniele Imparato. Zero variance markov chain Monte Carlo for bayesian estimators. Statistics and Computing, 23(5):653–662, 2013. [21] Chris J. Oates, Mark Girolami, and Nicolas Chopin. Control functionals for . Journal of the Royal Statistical Society: Series B (Statistical Methodology), pages n/a–n/a, 2016. [22] Gilles Pag`esand Fabien Panloup. Ergodic approximation of the distribu- tion of a stationary diffusion: rate of convergence. Ann. Appl. Probab., 22(3):1059–1100, 2012. [23] Gilles Pag`esand Fabien Panloup. Weighted multilevel Langevin simulation of invariant measures. Ann. Appl. Probab., 28(6):3358–3417, 2018. [24] Reuven Y Rubinstein and Dirk P Kroese. Simulation and the , volume 10. John Wiley & Sons, 2016. [25] L. F. South, C. J. Oates, A. Mira, and C. Drovandi. Regularised zero- variance control variates. arXiv preprint arXiv:1811.05073, 2018.