arXiv:1410.4329v1 [math.ST] 16 Oct 2014 EGY WANG NENG-YI high in dimension sampling Gibbs for inequalities concentration and rate Convergence DOI: Bernoulli 1350-7265 inequalit concentration Keywords: Gaussian some sharp prove and and m explicit Carlo rate some Monte gence chain establish Markov we for powerful condition, sampling a Gibbs uniqueness – the dimension study high to very is in paper this of objective The 2 1 esteBas acl 37 Aubi`ere, E-mail: Laboratoir France. 63177 and Pascal, China versit´e Beijing, Blaise 100190, Sciences, of Academy E-mail: China. Beijing, 100190, Sciences, of Academy Let Introduction 1. Carlo Monte chain Markov ; Gibbs where apig–aMro hi ot al ehd(CCi hr)frapp for short) in (MCMC method Carlo Monte chain µ Markov a – sampling n ( ing nepnnilnme ftrsadec fte a evr i or big very be may them model of to each difficult very and is terms it of dimension, number exponential an hsi neetoi ern fteoiia ril publis 1698–1716 detail. article 4, typographic original No. the 20, of Vol. reprint 2014, electronic an is This nttt fApidMteais cdm fMteaisa of Academy Mathematics, a Applied Mathematics of of Institute Academy Mathematics, Applied of Institute nfc,ee o h ipetcs where case simplest the for even fact, In . Let µ x µ 10.3150/13-BEJ537 j eaGbspoaiiymaueon measure probability Gibbs a be µ π (d j , i ( ssome is x ·| 6= 20 ocnrto nqaiy opigmto;Dbuhnsu Dobrushin’s method; coupling inequality; concentration

1 x c ., . . , ( ) i 4,21,1698–1716 2014, (4), under ) 04ISI/BS 2014 x ( = d x σ 1 N fiierfrnemaueon measure reference -finite x n IIGWU LIMING and 1 = ) µ ,x ., . . , n ¯ and ; R · · · N µ R ) hsrpitdffr rmteoiia npgnto and pagination in original the from differs reprint This . i E ∈ (d N E y e | N − x V ( = ) eterglrcniinldsrbto of distribution conditional regular the be ) ( e 2 x − µ 1 ,...,x V . Q E ( x j 1 N N 6= ,...,x i ) E ihdimension with π δ x (d = N E j (d ) x [email protected] u ups st td h Gibbs the study to is purpose Our . { [email protected] 1 + y ) e ythe by hed j · · · , )) −} siaeo h xoeta conver- exponential the of estimate eMt.CR-M 60 Uni- 6620, CNRS-UMR Math. de e e o h miia mean. empirical the for ies ⊗ π optn h eno observable of mean the computing stednmntrcontains denominator the as , (d µ x i dSsesSine Chinese Science, Systems nd Chinese Science, Systems nd to.UdrteDobrushin’s the Under ethod. (d N N y ) ISI/BS iuns condition; niqueness i π | eybg htis, that big, very x (d poutmeasure), (product ) x 1 ) in · · · Bernoulli ml o high for small π (d roximating x N x i ) , , know- 2 N.-Y. Wang and L. Wu where δ· is the Dirac measure at the point ·. We see that

1 N e−V (x ,...,x ) µ (dxi|x)= π(dxi), i −V (x1,...,xN ) i E e π(dx ) which is a one-dimensional measure,R easy to be realized in practice. The idea of the consists in approximating µ via iterations of the one- dimensional conditional distributions µi,i =1,...,N. It is described as follows. Given 1 N N a starting configuration x0 = (x0,...,x0 ) ∈ E , let (Xn; n ≥ 0) be a non-homogeneous F P defined on some (Ω, , x0 ), such that

X0 = x0 a.s., and given 1 N N XkN+i−1 = x = (x ,...,x ) ∈ E (k ∈ N, 1 ≤ i ≤ N), j j i i then XkN+i = x for j 6= i and the conditional law of XkN+i is µi(dy |x). In other words, the transition probability at step kN + i is:

(1) P(XkN+i ∈ dy|XkN+i−1 = x)=¯µi(dy|x). Therefore,

(2) P(XkN ∈ dy|X(k−1)N = x)=(¯µ1 ···µ¯N )(x, dy) =: P (x, dy).

Finally, the Gibbs sampling is the time-homogeneous Markov chain (Zk = XkN , k = 0, 1,...), whose transition probability is P . This MCMC algorithm is known sometimes as Gibbs sampler in the literature (see Winkler [31], Chapters 5 and 6). It is actively used in statistical physics, chemistry, biology and throughout the Bayesian (a sentence taken from [3]). It was used by Zegarlinski [34] as a tool for proving the logarithmic Sobolev inequality for Gibbs measures, see also the second named author [33] for a continuous time MCMC. Our purpose is two-fold: (1) the convergence rate of P k to µ; P 1 n (2) the concentration inequality for ( n k=1 f(Zk) − µ(f) ≥ t),t> 0. Question (1) is a classic subject. Earlier worksP by Meyn and Tweedie [21] and Rosenthal [25, 26] are based on the Harris theorem (minorization condition together with the drift condition in the non-compact case). Quantitative estimates in the Harris ergodic theorem are obtained more recently by Rosenthal [27] and Hairer and Mattingly [11]. But as indicated by Diaconis, Khare and Saloff-Coste [2, 3], theoretical results obtained from the Harris theorem are very far (even too far) from the convergence rate of numerical simulations in high dimension (e.g., N = 100). That is why Diaconis, Khare and Saloff- Coste [2, 3] use new methods and tools (orthogonal polynomials, stochastic monotonicity k and coupling) for obtaining sharp estimates of kνP − µkTV (total variation norm) for several special models in Bayesian statistics, with EN replaced by E × Θ, a space of two different components. Convergence rate and concentration inequalities 3

For the question (1), our tool will be the Dobrushin interdependence coefficients (very natural and widely used in statistical physics), instead of the minorization condition in the Harris theorem or the special tools in [2, 3]. Our main idea consists in constructing an appropriate coupling well adapted to the Dobrushin interdependence coefficients, close to that of Marton [20]. To the second question, we will apply the recent theory on transport inequalities (see Marton [17], Ledoux [13, 14], Villani [30], Gozlan and L´eonard [10] and references therein), and our approach is inspired from Marton [18, 20] and Djellout, Guillin and Wu [4] for dependent tensorization of transport inequalities. See [8, 24, 31] for Monte Carlo algorithms and diverse applications, and [12] for con- centration inequalities of general MCMC under the positive curvature condition. This paper is organized as follows. The main results are stated in the next section, and we prove them in Section 3.

2. Main results

Throughout the paper, E is a Polish space with the Borel σ-field B, and d is a metric on E such that d(x, y) is lower semi-continuous on E2 (so d does not necessarily generate the topology of E). On the product space we consider the L1-metric

N i i N dL1 (x, y) := d(x ,y ), x,y ∈ E . i=1 X i i If d(x ,y )=1xi=6 yi is the discrete metric on E, dL1 becomes the Hamming distance on EN , a good metric for concentration in high dimension as shown by Marton [17, 18].

2.1. Dobrushin’s interdependence coefficient

Let M1(E) be the space of probability measures on E and

d M1(E) := ν ∈M1(E); d(x0, x)ν(dx) < ∞  ZE  d 1 (x0 ∈ E is some fixed point). Given ν1,ν2 ∈M1(E), the L -Wasserstein distance between ν1,ν2 is given by

W1,d(ν1,ν2) := inf d(x, y)π(dx, dy), (2.1) π Z ZE×E where the infimum is taken over all probability measures π on E × E such that its marginal distributions are, respectively, ν1 and ν2 (coupling of ν1 and ν2, say). When d(x, y)= 1x=6 y (the discrete metric), it is well known that 1 W1,d(ν1,ν2) = sup |ν1(A) − ν2(A)| = kν1 − ν2kTV (total variation). A∈B 2 4 N.-Y. Wang and L. Wu

Recall the Kantorovich–Rubinstein duality relation [30]

|f(x) − f(y)| W (ν ,ν ) = sup fd(ν − ν ), kfk := sup . 1,d 1 2 1 2 Lip d(x, y) kfkLip≤1 ZE x=6 y i i j Let µi(dx |x) be the given regular conditional distribution of x knowing (x , j 6= i). i i d Throughout the paper, we assume that EN d(y , x0) dµ(y) < ∞, µi(·|x) ∈M1(E) for N N all i =1,...,N and x ∈ E , where x0 is some fixed point of E , and x → µi(·|x) is N d R Lipschitzian from (E , dL1 ) to (M1(E), W1,d). Define the matrix of the d-Dobrushin interdependence coefficients C := (cij )i,j=1,...,N

W1,d(µi(·|x),µi(·|y)) cij := sup j j , i,j =1,...,N. (2.2) x=y off j d(x ,y )

Obviously cii = 0. Then the well-known Dobrushin uniqueness condition (see [5, 6]) is read as N

(H1) r := kCk∞ := max cij < 1 1≤i≤N j=1 X or N

(H2) r1 := kCk1 := max cij < 1. 1≤j≤N i=1 X By the triangular inequality for the metric W1,d,

N j j W1,d(µi(·|x),µi(·|y)) ≤ cij d(x ,y ), 1 ≤ i ≤ N. (2.3) j=1 X 2.2. Transport inequality and Bobkov–G¨otze’s criterion

When µ,ν are probability measures, the Kullback information (or relative ) of ν with respect to µ is defined as dν log dν, if ν ≪ µ, H(ν|µ)= dµ (2.4)  Z  +∞, otherwise. We say that the  µ satisfies the L1-transport-entropy inequality on (E, d) with some constant C> 0, if

W1,d(µ,ν) ≤ 2CH(ν|µ), ν ∈M1(E). (2.5) p To be short, we write µ ∈ T1(C) for this relation. This inequality, related to the phe- nomenon of measure concentration, was introduced and studied by Marton [17, 18], Convergence rate and concentration inequalities 5 developed subsequently by Talagrand [29], Bobkov and G¨otze [1], Djellout, Guillin and Wu [4] and amply explored by Ledoux [13, 14], Villani [30] and Gozlan-L´eonard [10]. Let us mention the following Bobkov–G¨otze’s criterion.

Lemma 2.1 ([1]). A probability measure µ satisfies the L1-transport-entropy inequality on (E, d) with constant C > 0, that is, µ ∈ T1(C), if and only if for any Lipschitzian function F : (E, d) → R, F is µ-integrable and

λ2 eλ(F −hF iµ) dµ ≤ exp CkF k2 ∀λ ∈ R, 2 Lip ZE   where hF iµ = µ(F )= E F dµ. In that case, R t2 µ(F − hF i ≥ t) ≤ exp − , t> 0. µ 2CkF k2  Lip 

Another necessary and sufficient condition for T1(C) is the Gaussian integrability of µ, see Djellout, Guillin and Wu [4]. For further results and recent progresses see Gozlan and L´eonard [9, 10].

Remark 2.2. Recall also that w.r.t. the discrete metric d(x, y)=1x=6 y, any probabil- ity measure µ satisfies T1(C) with the sharp constant C =1/4 (the well known CKP inequality).

2.3. Main results

For any function f : EN → R, let |f(x) − f(y)| δi(f) := sup i i x=y off i d(x ,y ) be the Lipschitzian coefficient w.r.t. the ith coordinate xi. It is easy to see that

kfkLip(d ) = max δi(f). L1 1≤i≤N

Theorem 2.3 (Convergence rate). Under the Dobrushin uniqueness condition (H1), we have: N N (a) For any Lipschitzian function f on E and two initial distributions ν1,ν2 on E ,

N k k k E i i |ν1P f − ν2P f|≤ r max d(Z0(1),Z0(2)) δi(f), k ≥ 1, (2.6) 1≤i≤N i=1 X where (Z0(1),Z0(2)) is a coupling of (ν1,ν2), that is, the law of Z0(j) is νj for j =1, 2. 6 N.-Y. Wang and L. Wu

(b) In particular for any initial distribution ν on EN ,

k k E i i W1,d 1 (νP ,µ) ≤ Nr max d(Z0(1),Z0(2)), L 1≤i≤N

where (Z0(1),Z0(2)) is a coupling of (ν,µ).

By part (b) above µ is the unique invariant measure of P under the Dobrushin unique- k ness condition, and P (x, ·) converges exponentially rapidly to µ in the metric W1,dL1 , showing theoretically why the numerical simulations by the Gibbs sampling are very rapid.

Remark 2.4. Let us compare Theorem 2.3 with the known results in [2, 3, 21, 25, 26] on the convergence rate of the Gibbs sampling. At first the convergence rate in those known works is in the total variation norm, not in the metric W1,dL1 . When d is the discrete metric, we have by part (b) of Theorem 2.3

1 k k k 2 kνP − µkTV ≤ W1,dL1 (νP ,µ) ≤ Nr ∀ν, ∀k ≥ 1. Next, let us explain once again why the minorization condition in the Harris theorem does not yield accurate estimates in high dimension (see Diaconis et al. [2, 3] for similar discussions based on concrete examples). Indeed assume that E is finite, then under rea- sonable assumption on V (x1,...,xN ), there are constant c> 0 and a probability measure ν(dy) such that P (x, dy) ≥ e−cN ν(dy) (i.e., almost the best minorization that one can obtain in the dependent case). Hence by the Doeblin theorem (the ancestor of the Harris theorem),

1 k −cN k sup kP (x, ·) − µkTV ≤ (1 − e ) . 2 x∈EN

So one requires at least an exponential number ecN of steps for the right-hand side becoming small. Our estimate of the convergence rate is much better in high dimension, that is the good point of Theorem 2.3. The weak point of Theorem 2.3 is that our result depends on the Dobrushin uniqueness condition, even in low dimension. If N is small, the results in [21, 25, 26] are already good enough. Particularly the estimates of Diaconis, Khare and Saloff-Coste [2, 3] for the special space of two different components in Bayesian statistics are sharp. We should indicate that the Dobrushin uniqueness condition is quite natural for the exponential convergence of P k to µ with the rate r independent of N as in this theo- rem, since the Dobrushin uniqueness condition is well known to be sharp for the phase transition of mean field models [5–7]. Finally, our tool (Dobrushin’s interdependence coefficients) is completely different from those in the known works. Convergence rate and concentration inequalities 7

Remark 2.5. As indicated by a referee, it would be very interesting to investigate the convergence rate problem under the more flexible Dobrushin–Shlosman analyticity condition (i.e., box version of Dobrushin uniqueness condition, reference [7]), but in that case we feel that we should change the algorithm: instead of µi, one uses the conditional distribution µI (dxI |x) of xI knowing (xj ,j∈ / I) where I is a box containing i.

Remark 2.6. A much more classical topic is Glauber dynamics associated with the Gibbs measures in finite or infinite volume. We are content here to mention only Zegar- linski [34], Martinelli and Olivieri [16], and the lecture notes of Martinelli [15] for a great number of references.

The convergence rate estimate above will be our starting point for computing the mean 1 n µ(f), that is, to approximate µ(f) by the empirical mean n k=1 f(Zk). P Theorem 2.7. Assume

N 1 r1 := kCk1 = max cij < 1≤j≤N 2 i=1 X and for some constant C1 > 0,

N (H3) ∀i =1,...,N, ∀x ∈ E µi(·|x) ∈ T1(C1).

(Recall that C1 =1/4 for the discrete metric d(x, y)=1x=6 y.) Then for any Lipschitzian function f on EN with kfk ≤ α, we have: Lip(dL1 ) (a)

n n 2 2 P 1 1 k t (1 − 2r1) n x f(Zk) − P f(x) ≥ t ≤ exp − 2 ∀t> 0,n ≥ 1; n n 2C1α N k=1 k=1 !   X X (2.7) (b) furthermore if (H1) holds,

n 1 M P f(Z ) − µ(f) ≥ + t x n k n k=1 ! X (2.8) t2(1 − 2r )2n ≤ exp − 1 ∀t> 0,n ≥ 1, 2C α2N  1  where N r i i M = max d(x ,y ) dµ(y) · δi(f). 1 − r 1≤i≤N N E i=1 Z X 8 N.-Y. Wang and L. Wu

In conclusion under the conditions of this theorem, when n ≫ N, the empirical means 1 n n k=1 f(Zk) will approximate to µ(f) exponentially rapidly in probability with the speed n/N, with the bias not greater than M/n. The speed n/N is the correct one, as willP be shown in the remark below. We do not know whether the concentration inequality with the speed n/N still holds under the more natural Dobrushin’s uniqueness condition r1 = kCk1 < 1. We know only that r1 < 1 does not imply that P is contracting in the metric W1,dL1 , see the example in Remark 3.3.

1 N i R Remark 2.8. Consider f(x)= N i=1 g(x ) where g : E → is d-Lipschitzian with kgkLip = α (the observable of this type is often used in ). Since kfk = α , the inequality (2.7)P implies for all t> 0,n ≥ 1, Lip(dL1 ) N

n N n N 1 1 t2(1 − 2r )2nN P g(Zi ) − E g(Zi ) ≥ t ≤ exp − 1 , (2.9) x nN k x nN k 2C α2 k i=1 k i=1 ! 1 X=1 X X=1 X   which is of speed nN. Let us show that the concentration inequality (2.7) is sharp. In fact in the free case, that is, µi(dxi|x)= β(dxi) does not depend upon (xj )j=6 i and i, and µ is the product measure ⊗N β . In this case P (x, dy)= µ(dy), in other words (Zk)k≥1 is a sequence of independent and identically distributed (i.i.d. in short) random variables valued in EN , of common law µ. Since r1 = 0 in the free case, the concentration inequality (2.9) is equivalent to the transport inequality (H3) for µi = β, by Gozlan-L´eonard [9]. That shows also the speed n/N in Theorem 2.7 is the correct one.

Remark 2.9. We explain now why we do not apply directly the nice concentration results of Joulin and Ollivier [12] for general MCMC. In fact under the condition that r1 < 1/2, we can prove that

r1 1 W1,dL1 (P (x, ·), P (y, ·)) ≤ dL (x, y) 1 − r1 (by Lemma 3.2). In other words the Ricci curvature in [12] is bounded from below by

W1,d (P (x, ·), P (y, ·)) r 1 − 2r κ := 1 − sup L1 ≥ 1 − 1 = 1 . x=6 y dL1 (x, y) 1 − r1 1 − r1

Unfortunately we cannot show that the Ricci curvature κ is positive in the case where r1 ∈ [1/2, 1). If (E, d) is unbounded, the results of [12], Theorems 4 and 5, do not apply here, because their granularity constant 1 σ∞ = sup Diam(supp(P (x, ·))) 2 x∈EN explodes. Convergence rate and concentration inequalities 9

Assume now that (E, d) is bounded. If we apply the results ([12], Theorems 4 and 5) and their notations, their coarse diffusion constant

2 1 2 σ(x) = d 1 (y,z) P (x, dy)P (x, dz) 2 L Z Z is of order N 2; and their local dimension

σ(x)2 nx = inf f:kfkLip(d )=1 Var (f) L1 P (x,·) is of order N (by Lemma 3.4 below), and their granularity constant σ∞ is of order N. Setting

2 2 2 1 σ(x) 4V κn V = sup ; rmax = . κn x∈EN nxκ 3σ∞ Theorem 4 in [12] says that if kfk = 1, Lip(dL1 )

t2 n n 2exp − 2 , t ∈ (0, rmax), 1 1 k 16V Px f(Zk) − P f(x) ≥ t ≤    n n κnt k k !  X=1 X=1  2exp − , t ≥ rmax, 12σ∞    for all n ≥ 1 and t> 0. So for small deviation t, their result yields the same order Gaussian concentration inequality, but for large deviation t, their estimate is only exponential, not Gaussian as one may expect in this bounded case. In [12], Theorem 5, they get a same 2 type Gaussian-exponential concentration inequality with V , rmax depending upon the starting point x. Anyway the key lemmas in this paper are necessary for applying the results of [12] to this particular model.

Remark 2.10. For the Gibbs measure µ on RN , Marton [20] established the Talagrand N transport inequality T2 on R equipped with the Euclidean metric, under the Dobrushin– Shlosman analyticity type condition. The second named author [33] proved T1(C) for µ N on E equipped with the metric dL1 , under (H1). But those transport inequalities are for the equilibrium distribution µ, not for the Gibbs sampling which is a Markov chain with µ as invariant measure. However our coupling is very close to that of K. Marton.

Remark 2.11. For φ- sequence of dependent random variables, Rio [23] and Samson [28] established accurate concentration inequalities, see also Djellout, Guillin and Wu [4] and the recent works by Paulin [22] and Wintenberger [32] for generalizations and improvements. In the Markov chain case φ-mixing means the Doeblin uniform ergodicity. If one applies the results in [23, 28] to the Gibbs sampling, one obtains the concentration 10 N.-Y. Wang and L. Wu inequalities with the speed n/(NS2), where

∞ k S = sup kνP − µkTV. N k x∈E X=0 When (H1) holds with the discrete metric d, S is actually finite but it is of order N by Theorem 2.3 (and its remarks). The concentration inequalities so obtained from [23, 28] are of speed n/N 3, very far from the correct speed n/N.

Remark 2.12. When f depends on a very small number of variables, since kfk = Lip(dL1 ) maxi δi(f) does not reflect the nature of such observable, one can imagine that our concentration inequalities do not yield the correct speed. In fact in the free case and for f(x)= g(x1), the correct speed must be n, not n/N. For this type of observable, one may use the metric dL2 which reflects much better the number of variables in such observable. The ideas in Marton [19, 20] should be helpful. That will be another history.

3. Proofs of the main results 3.1. The construction of the coupling

N Given any two initial distributions ν1 and ν2 on E , we begin by constructing our coupled non-homogeneous Markov chain (Xi, Yi)i≥0, which is quite close to the coupling by Marton [20]. Let (X0, Y0) be a coupling of (ν1,ν2). And given

N N (XkN+i−1, YkN+i−1) = (x, y) ∈ E × E , k ∈ N, 1 ≤ i ≤ N, then j j j j XkN+i = x , YkN+i = y , j 6= i, and P i i ((XkN+i, YkN+i) ∈ ·|(Xkn+i−1, Ykn+i−1) = (x, y)) = π(·|x, y), where π(·|x, y) is an optimal coupling of µi(·|x) and µi(·|y) such that

d(˜x, y˜)π(d˜x, d˜y|x, y)= W1,d(µi(·|x),µi(·|y)). 2 Z ZE Define the partial order on RN by a ≤ b if and only if ai ≤ bi,i =1,...,N. Then, by (2.3), we have for ∀k ∈ N, 1 ≤ i ≤ N, E 1 1 1 1 [d(XkN+i, YkN+i)|XkN+i−1, YkN+i−1] d(XkN+i−1, YkN+i−1) . . . .   ≤ Bi   , . .  .   .   E N N   N N   [d(XkN+i, YkN+i)|XkN+i−1, YkN+i−1]   d(XkN+i−1, YkN+i−1)      Convergence rate and concentration inequalities 11 where 1 ..  .  1   Bi =  ci1 ci2 ········· ciN  (the blank in the matrix means 0).    1   .   ..     1      Therefore by iterations, we have

E 1 1 E 1 1 d(XN , YN ) d(X0 , Y0 ) . . . .   ≤ BN BN−1 ··· B1   . (3.1) . .  .   .   E N N   E N N   d(XN , YN )   d(X0 , Y0 )      Let

Q := BN BN−1 ··· B1. (3.2) Then we have the following lemma.

N Lemma 3.1. Under (H1), kQk∞ := max1≤i≤N j=1 Qij ≤ r. P Proof. We use the probabilistic method. Under (H1) we can construct Markov chain {ξ0,...,ξN }, taking values in {1,...,N} ⊔ ∆ where ∆ is an extra point representing the cemetery, and write as follows:

BN BN−1 B1 ξ0 → ξ1 → · · · → ξN , where the transition matrix from ξi to ξi+1 is BN−i, more precisely for ∀k =1,...,N,

P(ξk = j|ξk−1 = N − (k − 1)) = cN−(k−1),j ,

N

P(ξk = ∆|ξk−1 = N − (k − 1))=1 − cN−(k−1),j , j=1 X P(ξk = j|ξk−1 = i)= δij , i 6= N − (k − 1),

P(ξk = ∆|ξk−1 =∆)=1.

Here δij = 1 if i = j and 0 otherwise (Kronecker’s symbol). Then

kQk∞ = max P(ξN 6= ∆|ξ0 = i). 1≤i≤N 12 N.-Y. Wang and L. Wu

For any i =1,...,N, when ξ0 = i, we have ξ0 = ξ1 = ··· = ξN−i = i. Therefore,

N

P(ξN−i+1 6= ∆|ξ0 = i)= cij ≤ r j=1 X and thus

P(ξN 6= ∆|ξ0 = i) ≤ P(ξN−1 6= ∆|ξ0 = i) ≤···≤ P(ξN−i+1 6= ∆|ξ0 = i) ≤ r.

So kQk∞ ≤ r. 

3.2. Proof of Theorem 2.3 By (3.1) above, and iterations, E 1 1 E 1 1 d(XkN , YkN ) d(X0 , Y0 ) . .  .  ≤ Qk  .  . (3.3) . .  .   .   E N N   E N N   d(XkN , YkN )   d(X0 , Y0 )      Let Zk(1) = XkN ,Zk(2) = YkN , k ≥ 0, then by Lemma 3.1 E i i k E i i max d(Zk(1),Zk(2)) ≤ r max d(Z0(1),Z0(2)). (3.4) 1≤i≤N 1≤i≤N Now the results of this theorem follow quite easily from this inequality. In fact, (a) For any Lipschitzian function f : EN → R,

k k |ν1P f − ν2P f| = |Ef(Zk(1)) − Ef(Zk(2))| N E i i ≤ δi(f) d(Zk(1),Zk(2)) i=1 X N k E i i ≤ r max d(Z0(1),Z0(2)) δi(f), 1≤i≤N i=1 X where the last inequality follows by (3.4). That is (2.6). (b) Now for ν1 = ν, ν2 = µ, as µP = µ, we have

N k k k E i i W1,dL1 (νP ,µ)= W1,dL1 (νP , µP ) ≤ d(Zk(1),Zk(2)) i=1 X E i i ≤ N max d(Zk(1),Zk(2)) 1≤i≤N k E i i ≤ Nr max d(Z0(1),Z0(2)), 1≤i≤N the desired result. Convergence rate and concentration inequalities 13 3.3. Proof of Theorem 2.7

We begin with

Lemma 3.2. If r1 := kCk1 < 1 (i.e., (H2)), then for the matrix Q given in (3.2),

N r1 kQk1 := max Qkj ≤ . (3.5) 1≤j≤N 1 − r k 1 X=1 In particular

r1 N 1 W1,dL1 (P (x, ·), P (y, ·)) ≤ dL (x, y) ∀x, y ∈ E . (3.6) 1 − r1

Proof. The last conclusion (3.6) follows from (3.5) and (3.1). We show now (3.5). By the definition of Q = BN ···B1, it is not difficult to verify for 1 ≤ k ≤ N,

0, if j =1, j−1 k−1

Q =  c c − ··· c c + c 1 , (3.7) kj  k,il il,il 1 i2,i1=h h,j h,j h=k  h l k>i > >i >i h !  X=1 X=1 l ···X2 1= if 2 ≤ j ≤ N.   Here we make the convention ∅ · = 0. This can be obtained again by the Markov chain (ξ0, ξ1,...,ξN ) valued in {1,...,N}∪{∆} constructed in Lemma 3.1. Since P(ξN = P 1|ξN−1 = i) = 0 for all i, Qk1 = P(ξN =1|ξ0 = k) = 0: that is the first line in the expression of Q. Now for j =2,...,N, as

Qkj = P(ξN = j|ξ0 = k)= P(ξN = j|ξN−k = k) and if ξN−k = k and ξN−k+1 > k, then ξN−k+1 = ··· = ξN and so

j−1

P(ξN = j|ξN−k = k)= ckj 1k

N N j−1 k−1

Qkj = ck,il cil,il−1 ··· ci2,i1=hch,j + ch,j 1h=k k k h l k>i > >i >i h ! X=1 X=1 X=1 X=1 l ···X2 1= N j−1 N j−1 k−1

= chj 1h=k + ck,il cil,il−1 ··· ci2,i1=hch,j k h k h l k>i > >i >i h X=1 X=1 X=1 X=1 X=1 l ···X2 1= 14 N.-Y. Wang and L. Wu

j−1 j−1 N k−1

= chj + ck,il cil,il−1 ···ci2,i1=hch,j h h k l k>i > >i >i h X=1 X=1 X=1 X=1 l ···X2 1= N−1 j−1 N

≤ r1 + ck,il cil,il−1 ···ci2,i1=hch,j l h k l k>i > >i >i h X=1 X=1 =X+1 l ···X2 1= 2 N ≤ r1 + r1 + ··· + r1 , where the last inequality holds because for fixed l: 1 ≤ l ≤ N − 1 and h: 1 ≤ h ≤ j − 1,

N N l l l ck,il cil,il−1 ··· ci2,i1=h ≤ (C )kh ≤kC k1 ≤ r1. k k>i > >i >i h k X=1 l ···X2 1= X=1 So the proof of (3.5) is completed. 

Remark 3.3. Let µ be the Gaussian distribution on R2 with mean 0 and the covariance 1 r matrix ( r 1 ) where r ∈ (0, 1). We have c12 = c21 = r< 1 (i.e., (H1) and (H2) both hold); and under P ((x1, x2), (dy1, dy2)), z1 = y1 − rx2 and z2 = y2 − ry1 are i.i.d. Gaussian random variables with mean 0 and variance 1 − r2. Hence,

1 0 0 r 0 r Q = B B = = 2 1 r 0 0 1 0 r       2 and since y1 = rx2 + z1, y2 = z2 + rz1 + r x2,

′ ′ 2 ′ W1,dL1 [P ((x1, x2), ·), P ((x1, x2), ·)] = (r + r )|x2 − x2|.

2 Thus, kQk1 =2r and the Ricci curvature κ is positive if and only if r + r < 1. In other words, though we have missed many terms in the proof above, the estimate of kQk1 cannot be qualitatively improved.

Lemma 3.4. Assume (H2) and (H3), then

NC P (x , ·) ∈ T 1 ∀x = (x1,...,xN ) ∈ EN . 0 1 (1 − r )2 0 0 0  1  Proof. The proof is similar to the one used by Djellout, Guillin and Wu [4], Theorem 2.5. First for simplicity denote P (x0, ·) by P and note that for 1 ≤ i ≤ N,

1 1 i i XN = X1 ,...,XN = Xi , i 1 i−1 1 i−1 i+1 N P (XN ∈ ·|XN ,...,XN )= µi(·|XN ,...,XN , x0 ,...,x0 ) Convergence rate and concentration inequalities 15 and thus

i 1 1 i−1 i−1 i 1 1 i−1 i−1 W1,d(P (XN ∈ ·|XN = x ,...,XN = x ), P (XN ∈ ·|XN = y ,...,XN = y )) i−1 j j ≤ cij d(x ,y ). j=1 X N [1,i−1] For any probability measure Q on E such that H(Q|P ) < ∞, let Qi(·|x ) be the regular conditional law of xi knowing x[1,i−1], where i ≥ 2, x[1,i−1] = (x1,...,xi−1), and [1,i−1] 1 [1,i−1] Qi(·|x ) the law of x for i = 1, all under law Q. Define Pi(·|x ) similarly but under P . We shall use the Kullback information between conditional distributions,

[1,i−1] [1,i−1] [1,i−1] Hi(y )= H(Qi(·|y )|Pi(·|y )) and exploit the following important identity:

N [1,i−1] H(Q|P )= Hi(y ) dQ(y). N i=1 E X Z The key is to construct an appropriate coupling of Q and P , that is, two random sequences Y [1,N] and X[1,N] taking values on EN distributed according to Q and P , respectively, on some probability space (Ω, F , P). We define a joint distribution L (Y [1,N],X[1,N]) by induction as follows (the Marton coupling). At first the law of (Y 1,X1) is the optimal coupling of Q(x1 ∈ ·) and P (x1 ∈ ·) [1,i−1] [1,i−1] [1,i−1] [1,i−1] (= µ1(·|x0)). Assume that for some i, 2 ≤ i ≤ N, (Y ,X ) = (y , x ) is given. Then the joint conditional distribution L (Y i,Xi|Y [1,i−1] = y[1,i−1],X[1,i−1] = [1,i−1] [1,i−1] [1,i−1] x ) is the optimal coupling of Qi(·|y ) and Pi(·|x ), that is,

i i [1,i−1] [1,i−1] [1,i−1] [1,i−1] [1,i−1] [1,i−1] E(d(Y ,X )|Y = y ,X = x )= W1,d(Qi(·|y ), Pi(·|x )).

Obviously, Y [1,N],X[1,N] are of law Q, P , respectively. By the triangle inequality for the W1,d distance,

E(d(Y i,Xi)|Y [1,i−1] = y[1,i−1],X[1,i−1] = x[1,i−1]) [1,i−1] [1,i−1] [1,i−1] [1,i−1] ≤ W1,d(Qi(·|y ), Pi(·|y ))+ W1,d(Pi(·|y ), Pi(·|x ))

i−1 [1,i−1] j j ≤ 2C1Hi(y )+ cij d(x ,y ). j=1 q X By recurrence on i, this entails that Ed(Y i,Xi) < ∞ for all i =1,...,N. Taking the average with respect to L (Y [1,i−1],X[1,i−1]), summing on i and using Jessen’s inequality, we have

N i i N N i−1 j j Ed(Y ,X ) 2C EH (Y [1,i−1]) cij Ed(Y ,X ) i=1 ≤ 1 i=1 i + i=1 j=1 N s N N P P P P 16 N.-Y. Wang and L. Wu

N−1 j j N 2C H(Q|P ) Ed(Y ,X ) cij = 1 + j=1 i=j+1 r N P N P N j j 2C H(Q|P ) r1 Ed(Y ,X ) ≤ 1 + j=1 r N P N the above inequality gives us

N i i NC1 W1,d (Q, P ) ≤ Ed(Y ,X ) ≤ 2 H(Q|P ), L1 (1 − r )2 i=1 s 1 X

NC1  that is, P = P (x , ·) ∈ T ( 2 ). 0 1 (1−r1)

Theorem 2.7 is based on the following dependent tensorization result of Djellout, Guillin and Wu [4].

Lemma 3.5 ([4], Theorem 2.11). Let P be a probability measure on the product space n n n (E , B ),n ≥ 2. For any x = (x1,...,xn) ∈ E , x[1,k] := (x1,...,xk). Let Pk(·|x[1,k−1]) denote the regular conditional law of xk given x[1,k−1] under P for 2 ≤ k ≤ n, and Pk(·|x[1,k−1]) be the distribution of x1 for k =1. Assume that:

(1) For some metric d on E, Pk(·|x[1,k−1]) ∈ T1(C) on (E, d) for all k ≥ 1, x[1,k−1] ∈ Ek−1; (2) there is some constant S> 0 such that for all real bounded Lipschitzian function f(x ,...,x ) with kfk ≤ 1, for all x ∈ En,y ∈ E, k+1 n Lip(dL1 ) k

|EP(f(Xk+1,...,Xn)|X[1,k] = x[1,k]) − EP(f(Xk+1,...,Xn)|X[1,k] = (x[1,k−1],yk))|

≤ Sd(xk,yk).

Then for all function F on En satisfying kF k ≤ α, we have Lip(dL1 )

λ(F −EPF ) 2 2 2 EPe ≤ exp(Cλ (1 + S) α n/2) ∀λ ∈ R.

n Equivalently, P ∈ T1(Cn) on (E , dL1 ) with

2 Cn = nC(1 + S) .

We are now ready to prove Theorem 2.7.

N Proof of Theorem 2.7. We will apply Lemma 3.5 with (E, d) being (E , dL1 ), and P N n be the law of (Z1,...,Zn) on (E ) . Convergence rate and concentration inequalities 17

By (3.3), Lemma 3.2 and the condition that r1 < 1/2, the constant S in Lemma 3.5 is bounded from above by

∞ ∞ k 1 x,y r1 r1 sup E dL1 (XkN , YkN ) ≤ = . N d 1 (x, y) 1 − r 1 − 2r x,y∈E L k k 1 1 X=1 X=1 

r1 1 n Take S = , F (Z ,...,Zn)= f(Zk), then the Lipschitzian norm kF k 1−2r1 1 n k=1 Lip n N n of F w.r.t. the d 1 (x, y)= d 1 (x ,y ) (for x, y ∈ (E ) ) is not greater than L k=1 LP k k kfkLip(d )/n ≤ α/n. Thus by Lemmas 3.5 and 3.4, L1 P n n 1 1 Ex exp λ f(Z ) − P kf(x) n k n k k !! X=1 X=1 NC α 2 ≤ exp 1 λ2(1 + S)2 n/2 ∀λ ∈ R. (1 − r )2 n  1    So, by the classic approach, firstly using Chebyshev’s inequality, and then optimizing over λ ≥ 0, we obtain the desired part (a) in Theorem 2.7. Furthermore by Theorem 2.3, we have

n n 1 1 P kf(x) − µ(f) ≤ |P kf(x) − µ(f)| n n k=1 k=1 X X n N 1 k i i ≤ r max d(x ,y ) dµ(y) · δi(f) n 1≤i≤N N k E i=1 X=1 Z X N 1 r i i ≤ max d(x ,y ) dµ(y) · δi(f). n 1 − r 1≤i≤N N E i=1 Z X Thus, we obtain part (b) in Theorem 2.7 from its part (a). 

Acknowledgements

Supported in part by Thousand Talents Program of the Chinese Academy of Sciences and le projet ANR EVOL. We are grateful to the two referees for their suggestions and references, which improve sensitively the presentation of the paper.

References

[1] Bobkov, S.G. and Gotze,¨ F. (1999). Exponential integrability and transportation cost related to logarithmic Sobolev inequalities. J. Funct. Anal. 163 1–28. MR1682772 18 N.-Y. Wang and L. Wu

[2] Diaconis, P., Khare, K. and Saloff-Coste, L. (2008). Gibbs sampling, exponential families and orthogonal polynomials. Statist. Sci. 23 151–178. MR2446500 [3] Diaconis, P., Khare, K. and Saloff-Coste, L. (2010). Gibbs sampling, conjugate priors and coupling. Sankhya A 72 136–169. MR2658168 [4] Djellout, H., Guillin, A. and Wu, L. (2004). Transportation cost-information inequal- ities and applications to random dynamical systems and diffusions. Ann. Probab. 32 2702–2732. MR2078555 [5] Dobrushin, R.L. (1968). The description of a random field by means of conditional prob- abilities and condition of its regularity. Theory Probab. Appl. 13 197–224. [6] Dobrushin, R.L. (1970). Prescribing a system of random variables by conditional distri- butions. Theory Probab. Appl. 15 458–486. [7] Dobrushin, R.L. and Shlosman, S.B. (1985). Completely analytical Gibbs fields. In Sta- tistical Physics and Dynamical Systems (K¨oszeg, 1984). Progress in Probability 10 371–403. Boston, MA: Birkh¨auser. MR0821307 [8] Doucet, A., de Freitas, N. and Gordon, N. (2001). Sequential Monte Carlo Methods in Practice. Statistics for Engineering and Information Science. New York: Springer. MR1847783 [9] Gozlan, N. and Leonard,´ C. (2007). A large deviation approach to some transportation cost inequalities. Probab. Theory Related Fields 139 235–283. MR2322697 [10] Gozlan, N. and Leonard,´ C. (2010). Transport inequalities. A survey. Markov Process. Related Fields 16 635–736. MR2895086 [11] Hairer, M. and Mattingly, J.C. (2011). Yet another look at Harris’ ergodic theorem for Markov chains. In Seminar on Stochastic Analysis, Random Fields and Applications VI. Progress in Probability 63 109–117. Basel: Birkh¨auser. MR2857021 [12] Joulin, A. and Ollivier, Y. (2010). Curvature, concentration and error estimates for Markov chain Monte Carlo. Ann. Probab. 38 2418–2442. MR2683634 [13] Ledoux, M. (1999). Concentration of measure and logarithmic Sobolev inequalities. In S´eminaire de Probabilit´es, XXXIII. Lecture Notes in Math. 1709 120–216. Berlin: Springer. MR1767995 [14] Ledoux, M. (2001). The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs 89. Providence, RI: Amer. Math. Soc. MR1849347 [15] Martinelli, F. (1999). Lectures on Glauber dynamics for discrete spin models. In Lectures on and Statistics (Saint-Flour, 1997). Lecture Notes in Math. 1717 93–191. Berlin: Springer. MR1746301 [16] Martinelli, F. and Olivieri, E. (1994). Approach to equilibrium of Glauber dynamics in the one phase region. I. The attractive case. Comm. Math. Phys. 161 447–486. MR1269387 [17] Marton, K. (1996). Bounding d-distance by informational divergence: A method to prove measure concentration. Ann. Probab. 24 857–866. MR1404531 [18] Marton, K. (1996). A measure concentration inequality for contracting Markov chains. Geom. Funct. Anal. 6 556–571. MR1392329 [19] Marton, K. (2003). Measure concentration and strong mixing. Studia Sci. Math. Hungar. 40 95–113. MR2002993 [20] Marton, K. (2004). Measure concentration for Euclidean distance in the case of dependent random variables. Ann. Probab. 32 2526–2544. MR2078549 [21] Meyn, S.P. and Tweedie, R.L. (1993). Markov Chains and Stochastic Stability. Commu- nications and Control Engineering Series. London: Springer. MR1287609 Convergence rate and concentration inequalities 19

[22] Paulin, D. (2012). Concentration inequalities for Markov chains by Marton coupling. Preprint. [23] Rio, E. (2000). In´egalit´es de Hoeffding pour les fonctions lipschitziennes de suites d´ependantes. C. R. Acad. Sci. Paris S´er. I Math. 330 905–908. MR1771956 [24] Roberts, G.O. and Rosenthal, J.S. (2004). General state space Markov chains and MCMC algorithms. Probab. Surv. 1 20–71. MR2095565 [25] Rosenthal, J.S. (1995). Minorization conditions and convergence rates for Markov chain Monte Carlo. J. Amer. Statist. Assoc. 90 558–566. MR1340509 [26] Rosenthal, J.S. (1996). Analysis of the Gibbs sampler for a model related to James–Stein estimations. Statist. Comput. 6 269–275. [27] Rosenthal, J.S. (2002). Quantitative convergence rates of Markov chains: A simple ac- count. Electron. Commun. Probab. 7 123–128 (electronic). MR1917546 [28] Samson, P.M. (2000). Concentration of measure inequalities for Markov chains and Φ- mixing processes. Ann. Probab. 28 416–461. MR1756011 [29] Talagrand, M. (1996). Transportation cost for Gaussian and other product measures. Geom. Funct. Anal. 6 587–600. MR1392331 [30] Villani, C. (2003). Topics in Optimal Transportation. Graduate Studies in Mathematics 58. Providence, RI: Amer. Math. Soc. MR1964483 [31] Winkler, G. (1995). Image Analysis, Random Fields and Dynamic Monte Carlo Methods: A Mathematical Introduction. Applications of Mathematics (New York) 27. Berlin: Springer. MR1316400 [32] Wintenberger, O. (2012). Weak transport inequalities and applications to exponential inequalities and oracle inequalities. Preprint. [33] Wu, L. (2006). Poincar´eand transportation inequalities for Gibbs measures under the Dobrushin uniqueness condition. Ann. Probab. 34 1960–1989. MR2271488 [34] Zegarlinski,´ B. (1992). Dobrushin uniqueness theorem and logarithmic Sobolev inequal- ities. J. Funct. Anal. 105 77–111. MR1156671

Received February 2013 and revised April 2013