<<

Effective limit theorems for Markov chains witha spectral gap Benoît Kloeckner

To cite this version:

Benoît Kloeckner. Effective limit theorems for Markov chains with a spectral gap. 2017. ￿hal- 01497377v2￿

HAL Id: hal-01497377 https://hal.archives-ouvertes.fr/hal-01497377v2 Preprint submitted on 30 Nov 2017 (v2), last revised 2 Mar 2019 (v4)

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Effective limit theorems for Markov chains with a spectral gap

Benoît R. Kloeckner * November 30, 2017

Applying quantitative perturbation theory for linear operators, we prove non-asymptotic limit theorems for Markov chains whose transition kernel has a spectral gap in an arbitrary of functions X . The main results are concentration inequalities and Berry-Esseen bounds, obtained as- suming neither reversibility nor “warm start” hypothesis: the law of the first term of the chain can be arbitrary. The spectral gap hypothesis is basically a uniform X -ergodicity hypothesis, and when X consist in regular functions this is weaker than uniform ergodicity. We show on a few examples how the flexibility in the choice of function space can be used. The constants are completely explicit and reasonable enough to make the results usable in practice, notably in MCMC methods.

1 Introduction

General framework. Let (푋푘)푘≥0 be a Markov chain taking value in a general state space Ω, and let 휙 :Ω → R be a function (the “observable”). Under rather general assumptions, there is a unique stationary measure 휇0 and it can be proved that almost surely1 푛 1 ∑︁ 휙(푋 ) → 휇 (휙) (1) 푛 푘 0 푘=1 Then a natural question is to ask at what speed this convergence occurs. In many cases, one can prove a Central Limit Theorem, showing that the convergence has the order

*Université Paris-Est, Laboratoire d’Analyse et de Matématiques Appliquées (UMR 8050), UPEM, UPEC, CNRS, F-94010, Créteil, France 1Here and in the sequel, we write indifferently 휇(푓) or ∫︀ 푓 d휇 for the integral of 푓 with respect to the measure 휇.

1 √ 1/ 푛. But this is again an asymptotic result, and one is led to ask for non-asymptotic bounds, both for the Law of Large Numbers (1) (“concentration inequalities”) and for the CLT (“Berry-Esseen bounds”).

A word on effectivity. In this paper, the emphasis will be on effective bounds, i.e. given an explicit sample size 푛, one should be able to deduce from the bound that the quantity being considered lies in some explicit interval around its limit with at least some explicit probability. In other words, the result should be non-aymptotic and all constants should be made explicit. The motivations for this are at least twofold. First, in practical applications of the Markov chain Monte-Carlo (MCMC) method, where one uses (1) to estimate the integral 휇0(휙), effective results are needed to obtain proven convergence of a given precision. MCMC methods are important when the mea- sure of interest is either unknown, or difficult to sample independently (e.g. uniform in a convex set in large dimension), but happens to be the stationary measure for an easily simulated Markov chain. The Metropolis-Hastings algorithm for example makes it possible to deal with an absolutely continuous measure whose density is only known up to the normalization constant. A second, more theoretical motivation is that the constants appearing in limit theorem depend on a number of parameters (e.g. the mixing speed of the Markov chain, the law of 푋0, etc.). When the constants are not made explicit, one may not be able to deduce from the result how the convergence speed changes when some parameter approaches the limit of the domain where the result is valid (e.g. when the spectral gap tend to 0). There are very many works proving concentration inequalities and (to a lesser extent) Berry-Esseen bounds for Markoc chains, under a variety of assumptions, and we will only mention a small number of them. To explain the purpose of this article, let us discuss briefly three directions.

Previous works (1): total variation convergence. The first direction is mainly mo- tivated by MCMC; we refer to [RR+04] for a detailed introduction to the topic. The Markov chains being considered are usually ergodic (either uniformly, which in the setting of this paper corresponds to a spectral gap on 퐿∞, or geometrically); one measures difference between probability measure using the total variation distance, and the limit theorems are typically obtained for 퐿∞ observables 휙 (the emphasis here is not on the boundedness, but on the lack of regularity assumption). Effective concentration inequalities have been obtained in this setting, for example in [GO02] and [KLMM05] which we shall discuss below. Berry-Esseen bounds have been proved in [Bol82], but effective results are less common.

Previous works (2): the . The second direction grew from the “Na- gaev method” [Nag57, Nag61], a functional approach where perturbative enables one to adapt the classical Fourier proofs of limit theorems, from independent identically distributed random variable to suitable Markov chains. This approach is described in [HH01] in a quite general setting, and is especially popular in dynamical

2 systems (the statistical properties of certain dynamical systems can be studied more eas- ily by reversing time, and considering a Markov chain jumping randomly along backward orbits). There, the Markov chain being considered are often not ergodic in the total variation sense, but instead their transition kernel has a spectral gap in a space X made of regular functions (one sometimes say such a Markov chain is X -ergodic). The limit theorems are then restricted to observables 휙 ∈ X , and the speed of convergence is driven by the regularity of 휙 as much as by its magnitude. Due to the use of perturbation theory of operator, in most cases this method has not yielded effective results. Note that the spectral method can be applied without regularity assumptions, taking 2 ∞ e.g. X = 퐿 (휇0) or X = 퐿 (Ω) (or variants, see [KM12]), thus the present direction intersects the previous one. There are (at least) two exceptions to the aforementioned lack of effectiveness. When X is a , by symetrization of the transition kernel one can use well-known effective perturbation results. In this way, Lezaud obtains effective concentration in- equalities and Berry-Esseen bounds [Lez98, Lez01], see also [Pau15]. Both work in 2 퐿 (휇0), restricting accordingly the Markov chains that can be considered. Second Dubois [Dub11] gave what seems to be the first effective Berry-Esseen inequality in a dynamical context, and we shall compare the present Berry-Esseen inequality with his.

Previous works (3): optimal transportation. The third direction is quite recent: Joulin and Ollivier [JO10] used ideas from optimal transportation to prove very effi- ciently effective concentration results under a positive curvature hypothesis; this corre- sponds to a spectral gap with constant 1 on the space X = Lip of Lipschitz functions. Paulin [Pau16] extended this method to a spectral gap “with arbitrary constant” in the terminology set up below. This method is very appealing, but is restricted to a single, pretty restrictive function space constraining both the Markov chains and the observables that can be considered; we will see in examples below that being able to change the function space can be useful to get good constants even when [JO10] can be applied. Moreover, this method seems unable to provide higher-order limit theorem such as the CLT or Berry-Esseen bounds.

Contributions of this work. The goal of this article is to combine recent effective perturbation results [Klo17b] with the Nagaev method to obtain effective concentra- tion inequalities and Berry-Esseen bounds for a wealth of Markov chains. Our main hypothesis will be a spectral gap on some function space X , with the sole restriction that we need X to be a Banach algebra (this will in particular restrict us to bounded observables). We obtain three main results: ∙ a general concentration inequality (TheoremA),

∙ a variant which, under a bound on the dynamical variance of (휙(푋푘))푘≥0, gives an optimal rate for small enough deviations (TheoremB), ∙ a general Berry-Esseen bound (TheoremC).

3 Let us give a few examples where our results apply:

∙ taking X = 퐿∞(Ω), our assumptions essentially reduce to uniform ergodicity of the Markov chain and boundedness of the observable, a very classical case. We obtain a convergence rate proportional to the spectral gap, improving on [GO02, KLMM05] where the rate is proportional to the square of the spectral gap (see Section 3.2 and especially Theorem 3.3),

∙ taking X = Lip(Ω), our assumptions essentially reduce to positively curved Markov chains (in the sense of Ollivier) and bounded Lipschitz observables. This for example applies to contracting Iterated Function Systems and backward ran- dom walks of expanding maps. We shall see (Section 3.3) that in the toy case of the discrete hypercube and observables with small Lipschitz constant, TheoremA is less powerful than [JO10] but that for larger Lipschitz constants, TheoremB can improve on [JO10],

∙ when Ω is a graph, we propose a functional space of functions with small “local total variations” and show on an example that it can improve on [JO10] (also in Section 3.3),

∙ taking X = BV(퐼) where 퐼 is an interval, we show that our results apply to a natural Markov chains related to Bernoulli convolutions, and allowing observables of bounded variation makes our result applicable to e.g. characteristic functions of intervals,

∙ more generally, when Ω is a domain of R푑 some natural Markov chains are BV(Ω)- ergodic and our results apply to functions of bounded variation, e.g. characteristic functions of sets of finite perimeter – but we will not consider this case here, since it needs a somewhat sophisticated setup,

∙ Another direction we do not explore here is to take X = Hol훼(Ω), the space of 훼-Hölder functions, or in case Ω = 퐼 is an interval, X = BV푝(퐼), the space of 푝- bounded variation functions. These enables one to consider more general functions than Lip(Ω) or respectively BV(Ω); even for Lipschitz of BV functions, using these spaces can be useful because they tend to give regular observables a much lower norm.

To my knowledge, no effective result was known in the setting of bounded variation functions (and while the usual spectral method could have been used in this case, I do not know of previous asymptotic results either); no effective result could give optimal rate for moderate deviation as TheoremB does; and the effective Berry-Esseen bound seems new in most of the above cases.

Possible follow-ups. While we shall take some time discussing examples, we are far from having exhausted the domain of applicability of our results. Finding other cases to which they apply is a natural direction to pursue.

4 One limitation of this work is that we ask for a spectral gap with constant 1, i.e. the averaging operator defined by the transition kernel should be contracting on the kernel of the stationary measure. Extending to spectral gaps with arbitrary constants (i.e. eventual exponential contractivity) is possible in principle with the methods used here, but would be very technical. It could improve the rate of convergence in many cases, by replacing 훿0 (defined so that 1 − 훿0 is the contraction constant) by the real spectral gap 훿 (while 훿0 is a lower bound on the spectral gap). In practical applications, this is not crucial since one can always extract a sub-Markov chain (푋ℓ푘)푘≥0: for ℓ large enough one gets 훿0 close to 훿 (see Remark 2.6). It appears from the numerical comparison with previous results that our improvement are in several case asymptotically strong but numerically modest (the improvements are large only in relatively extreme ranges of the parameters), and this seems in part due to the restriction to Banach algebra, and the ensuing necessity to combine ‖·‖∞ with a semi-norm (see Section3). It would be interesting to dispense from this necessity, for example by making effective the Keller-Liverani theorem [KL99].

Structure of the article. In Section2 we state notation and the main results. Section 3 explains in detail the aforementioned examples and comparison with previous results. In Section4 we recall how perturbation theory can be used to prove limit theorems, and state the perturbation results we need to carry out this method in a effective manner. In Section5 we prove the core estimates to be used thereafter, while Section6 carries out the proof of the concentration inequalities. Section7 is devoted to the proof of the Berry-Esseen inequality.

2 Assumptions and main results

Let Ω be a polish metric space endowed with its Borel algebra and denote by 풫(Ω) the set of probability measures on Ω. We consider a transition kernel M = (푚푥)푥∈Ω on Ω, i.e. 푚푥 ∈ 풫(Ω) for each 푥 ∈ Ω, and a Markov chain (푋푘)푘≥0 following the kernel M, i.e. P(푋푘+1 | 푋푘 = 푥) = 푚푥. We do not ask the Markov chain to be stationary: the law of 푋0 is arbitrary (“cold start”); in some cases of interest, the law of each 푋푘 will even be singular with respect to the stationary measure 휇0. We shall study the behavior of (푋푘)푘≥0 by comparing the empirical mean to the stationary mean: 푛 1 ∑︁ 휇ˆ (휙) := 휙(푋 ) vs. 휇 (휙) 푛 푛 푘 0 푘=1 for an arbitrary “observable” 휙 ∈ X , where X is a space of functions Ω → R (or Ω → C).

2.1 Assumptions Standing assumption 2.1. In all the paper, we assume X satisfies the following:

5 i. its norm ‖·‖ dominates the uniform norm: ‖·‖ ≥ ‖·‖∞,

ii. X is a Banach algebra, i.e. for all 푓, 푔 ∈ X we have ‖푓푔‖ ≤ ‖푓‖‖푔‖,

iii. X contain the constant functions and ‖1‖ = 1 (where 1 denotes the constant function with value 1).

The first hypothesis ensures integrability with respect to arbitrary probability mea- sure, which is important for cold-start Markov chains; it also implies that every probabil- ity measure can be seen as a continuous linear form acting on X . The second hypothesis will prove very important in our method where products abound (and can be replaced by the more lenient ‖푓푔‖ ≤ 퐶‖푓‖‖푔‖ up to multiplying the norm by a constant), and the hypothesis on ‖1‖ is a mere matter of convenience and could be removed at the cost of more complicated formulas.

Remark 2.2. This setting may seem restrictive at first: the Banach algebra hypothesis notably excludes 퐿푝 spaces, while classically one only makes moment assumptions on the observable. This is quite unavoidable given that we will work with more than one equivalence class of measures, and we want to allow cold start at a given position

(푋0 ∼ 훿푥0 ). The measures 푚푥 may be singular with respect to the stationary measure 휇0, and as a matter of fact in the dynamical applications 푚푥 will be purely atomic 푝 while 휇0 will often be atomless. It may thus happen that for 휙 an 퐿 (휇0) observable, 휙(푋푗) is undefined with positive probability, or is extremely large evenif 휙 has small moments with respect to 휇0. Our framework ensures enough regularity to prevent such phenomenons.

To the transition kernel M is associated an averaging operator acting on X : ∫︁ L0푓(푥) = 푓(푦) d푚푥(푦). Ω

Since each 푚푥 is a probability measure, L0 has 1 as eigenvalue, with eigenfunction 1. Standing assumption 2.3. In all the article we assume M satisfies the following:

i. L0 acts as a from X to itself, and its operator norm ‖L0‖ is equal to 1.

ii. L0 has a spectral gap with constant 1 and size 훿0 > 0, i.e. there is an hyperplane 퐺0 ⊂ X such that ‖L0푓‖ ≤ (1 − 훿0)‖푓‖ ∀푓 ∈ 퐺0,

The first hypothesis could be relaxed, considering operators of arbitrary norm, atthe cost of (much) more complicated formulas. The second hypothesis is the main one, and implies in particular that 1 is a simple isolated eigenvalue.

6 Remark 2.4. This second hypothesis ensures that up to scalar factors there is a unique continuous linear form 휑0 acting on X such that 휑0 ∘ L0 = 휑0; since any stationary measure of M satisfy this, all stationary measures coincide on X . They might not be unique (e.g. if X contains only constants), but since we consider the 휙(푋푘) with 휙 ∈ X , this will not matter. We will thus denote an arbitrary stationary measure by 휇0, and identify it with 휑0 (observe that 퐺0 is then equal to ker 휇0). In most cases, X will be dense in the space of continuous function endowed with the uniform norm, ensuring that two measures coinciding on X are equal, and then the spectral gap hypothesis ensures the uniqueness of the stationary measure. Remark 2.5. There are numerous examples where assumptions 2.1 and 2.3 are satisfied; we will discuss a few of them in Section3. Typically, X has a norm of the form ‖·‖ = ‖·‖∞ + 푉 (·) where 푉 is a seminorm measuring the regularity in some sense (e.g. Lipschitz constant, 훼-Hölder constant, total variation, total 푝-variation...) and satisfying 푉 (푓푔) ≤ ‖푓‖∞푉 (푔) + 푉 (푓)‖푔‖∞. This inequality ensures that X is a Banach Algebra, and ‖1‖ = 1 holds as soon as 푉 (1) = 0. Since averaging operators necessarily satisfy ‖L0푓‖∞ ≤ ‖푓‖∞, it is sufficient that L contracts 푉 (i.e. 푉 (L0푓) ≤ 휃푉 (푓) for some 휃 ∈ (0, 1) and all 푓 ∈ X ) to ensure that ‖L0‖ = 1. We will prove in Lemma 3.1 that in many cases, the contraction also implies a spectral gap of explicit size and constant 1. In fact, all examples considered here are of this kind, but it seemed better to state our main results in terms of the hypotheses we use directly in the proof. This is done at the expense of some sharpness: indeed we could improve our constants under the hypotheses of Lemma 3.1, by estimating with more precision ‖휋0‖ below. The method is similar to the proof of Lemma 3.1, and is carried out in two examples in [Klo17a]. Remark 2.6. In some cases, one gets a spectral gap with a constant greater than 1, i.e.

푛 푛 ‖L0 푓‖ ≤ 퐶(1 − 훿0) ‖푓‖ ∀푓 ∈ 퐺0 for all 푛 ∈ N and some 퐶 > 1. In this case, all our result apply to the Markov chains 푛 푌푚 = 푋푛0+푚푘 where 푛0 is arbitrary and 푘 is such that 퐶(1 − 훿0) < 1. This can be also used when 퐶 = 1, in cases where the spectral gap is small. In numerical computations, this can be especially useful when the simulation of the random walk is much cheaper than the evaluation of the observable.

2.2 A general concentration inequality Our first result is a concentration inequality, featuring the expected dichotomy between a Gaussian regime and an exponential regime.

Theorem A. For all 푛 ≥ 60/훿0 and all 푎 > 0, it holds ⎧ (︁ 푛푎2 훿 )︁ 푎 ⎪ 0 ⎪2.5 exp − 2 if ≤ 훿0/3 [︁ ]︁ ⎨⎪ ‖휙‖ 13.44 훿0 + 8.324 ‖휙‖ P휇 |휇ˆ푛(휙) − 휇0(휙)| ≥ 푎 ≤ ⎪ (︀ 푛푎 2)︀ ⎪2.7 exp − · 0.009 훿 otherwise ⎩ ‖휙‖ 0

7 Remark 2.7. In TheoremA we made a compromise between precision and simplicity; we in fact obtain slightly better front constants, a slightly relaxed range for 푛, and a more precise rate in the exponential regime (see Theorems 6.3 and 6.4). See Section3 for several applications and comparisons with previous results. Let us stress right away that the main strength of the present result is its broadness: we need no warm-start hypothesis, no reversibility, and we can apply it in many functional spaces. In particular, this makes our results broader than those of [Lez98, Lez01] which 2 assume ergodicity. Lezaud also gets a front constant proportional to the 퐿 (휇0)-norm of the density of the distribution of 푋0 with respect to the stationary distribution, which would be infinite in many of our cases of applicability; even in the case of a finite state space he then gets a large front constant 푋0 ∼ 훿푥. The approach of Joulin and Ollivier enabled them to get rid of this constant in some test cases, and we compare our results to theirs in Section 3.3. In some cases, TheoremA can improve on previous results by allowing one to choose the functional space most suited to the situation at hand; an example is treated in detail in Section 3.3.3. While our constants are certainly not optimal, we get what seems the correct depen- dence in the spectral gap, at least in the Gaussian regime (rate proportional to 훿0); in the case of Markov chains satisfying the Doeblin minorization condition, this improves 2 on the rate proportional to 훿0 in [KLMM05] (see Section 3.2).

2.3 Concentration under a variance bound The spectral method gives us access to higher-order estimates, enabling us to improve the Gaussian regime bound as soon as we have a good control over the “dynamical variance” (also called “asymptotic variance”). This quantity is defined as

2 2 2 ∑︁ 푘 휎 (휙) = 휇0(휙 ) − (휇0휙) + 2 휇0(휙L0휙¯) 푘≥1 where 휙¯ = 휙 − 휇0(휙). The dynamical variance is precisely the variance appearing in the CLT for (휙(푋푘))푘≥0. 푈훿2 Theorem B. If 푈 ∈ [0, +∞) is a upper bound for 휎2(휙), then for all 푎 ≤ 0 and 26‖휙‖ all 푛 ≥ 60/훿0 it holds (︁ 푛푎2 푛푎3‖휙‖3 )︁ [︀|휇ˆ (휙) − 휇 (휙)| ≥ 푎]︀ ≤ 2.7 exp − + 10(1 + 훿−1)2 P휇 푛 0 2푈 푈 3 0 For small enough 푎, the positive term in the exponential is negligible, and the leading term is exactly the best we can expect given the available knowledge: since (휙(푋푘))푘 satisfies a Central Limit Theorem with variance 휎2(휙), any better value would necessarily imply a better bound on 휎2(휙). Remark 2.8. Again we actually prove a slightly more precise result, see Section 6.3.

8 In Section 3.3.3, we show on an example how TheoremB can improve on Theorem A. However obtaining a good estimation on the dynamical variance can be difficult; in practical applications, one could use other tools to estimate it, and then apply Theorem B.

2.4 A Berry-Esseen bound Our third main result, proven in section7, quantifies the speed of convergence in the Central Limit Theorem.

2 휙−휇0(휙) Theorem C. Assume 휎 (휙) > 0 and let 휙˜ := 휎(휙) be the reduced centered version of 휙, and denote by 퐺, 퐹푛 the distribution functions of the reduced centered normal law √1 and of 푛 (휙 ˜(푋1) + ··· +휙 ˜(푋푛)), respectively. 2 For all 푛 ≥ (60/훿0) it holds

(훿−1 + 1.13)2 max{‖휙˜‖, ‖휙˜‖3} ‖퐹 − 퐺‖ ≤ 177 0 √ 푛 ∞ 푛

Remark 2.9. The hypothesis on 푛 is pretty harmless: if ‖휙˜‖ ≃ ‖휙˜3‖ ≃ 1 then for 2 푛 ≤ (60/훿0) the right hand side is much larger than 1, and the inequality is void. As before, a slightly more precise result can be obtained (see Section7).

Remark 2.10. Note that 휎2(휙) is always non-negative, as it can be rewritten as

푛 1 (︀ ∑︁ )︀ lim Var휇 휙(푋푘) 푛→∞ 푛 0 푘=1

2 (where the 휇0 subscript means that the assumption 푋0 ∼ 휇0 is made). However, 휎 (휙) can vanish even when 휙 is not constant modulo 휇0, as is the case in a dynamical setting −1 when 푚푥 is supported on 푇 (푥) for some map 푇 :Ω → Ω, and 휙 is a coboundary: 휙 = 푔 − 푔 ∘ 푇 for some 푔. One can for example see details [GKLMF15], where 휎2 is interpreted as a semi-norm. Whenever 휎2(휙) = 0, one can use the present method to obtain stronger non-asymptotic concentration inequalities, giving√ small probability to deviations 푎 such that 푎/‖휙‖ ≫ 1/푛2/3 instead of 푎/‖휙‖ ≫ 1/ 푛.

There are numerous works on Berry-Esseen bounds. In the case of independent iden- tically distributed random variables, the optimal constant is not yet known (the best known constant is, to my knowledge, given by Tyurin [Tyu11]). Berry-Esseen bounds for Markov chains go back to [Bol82], but I know only of two previous effective result, by Dubois [Dub11] and by Lezaud [Lez01]. The scope of Dubois’ result is quite narrower than ours, as it is only written for uniformly expanding maps of the interval and Lipschitz observables (though the method is expected to have wider application), and our numerical constant is much better: while the dependences on the parameters of the system are stated differently and thus somewhat difficult to compare, Dubois has a front constant of 11460 which is quite large

9 √ for practical applications (the order of convergence being 1/ 푛, this constant has a squared effect on the number of iterations needed to achieve a given precision). The scope of Lezaud’s Berry-Esseen bound is also restricted, to ergodic reversible 2 Markov chains. Moreover he gets a front constant proportional to the 퐿 (휇0)-norm of the density of the distribution of 푋0 with respect to the stationary distribution; in comparison, our result is insensitive to the distribution of 푋0.

Application to dynamical systems As is well-known, limit theorems for Markov chain also apply in a dynamical setting; let us give some details. Given a 푘-to-one map 푇 :Ω → Ω, one defines the of a potential 퐴 ∈ X by ∑︁ 퐴(푦) L푇,퐴푓(푥) = 푒 푓(푦). 푦∈푇 −1(푥)

One says that 퐴 is normalized when L푇,퐴1 = 1. This condition exactly means that ∑︀ 퐴(푦) 푚푥 = 푦∈푇 −1(푥) 푒 훿푦 is a probability measure for all 푥, making L푇,퐴 the averaging operator of a transition kernel. We could consider more general maps 푇 , considering a transition kernel that is supported on its inverse branches. If the transfer operator has a spectral gap, then the stationary measure 휇0 is unique, and readily seen to be 푇 -invariant. We shall denote it by 휇퐴 to stress the dependence on the potential. The corresponding stationary Markov chain (푌푘)푘∈N satisfies all results presented above; but for each 푛, the time-reversed process defined by 푋푘 = 푌푛−푘 (where 0 ≤ 푘 ≤ 푛) satisfies 푋푘+1 = 푇 (푋푘): all the randomness lies in 푋0 = 푌푛. Having taken 푌푛 stationary makes the law of 푌푛, i.e. 푋0, independent of the choice of 푛. It follows:

Corollary 2.11. For all normalized 퐴 ∈ X such that L푇,퐴 has a spectral gap with constant 1 and size 훿0, for all 휙 ∈ X , TheoremsA,B andC hold for the random process (푋푘)푘∈N defined by 푋0 ∼ 휇퐴 and 푋푘+1 = 푇 (푋푘). In this context, spectral gap was proved in many cases under the impetus of Ruelle, see e.g. the books [Bal00, Rue04], the recent works [BT08, CV13, CS09], and refer- ences therein. Let me finally mention [Klo17a] (which is based on the same effective perturbation theory as the present paper) and [Klo17c] (which is my initial motivation to consider the spectral method for limit theorems).

3 Examples

3.1 Preliminary lemma In each example below we will use the following lemma which, in the spirit of Doeblin- Fortet and Lasota-Yorke inequalities, enables to turn an exponential contraction in the “regularity part” of a functional norm into a spectral gap. Lemma 3.1. Consider a normed space X of (Borel measurable, bounded) functions Ω → R, with norm ‖·‖ = ‖·‖∞ +푉 (·) where 푉 is a semi-norm (usually quantifying some regularity of the argument, such as Lip or BV).

10 Assume that for some constant 퐶 > 0, for all probability 휇 on Ω and for all 푓 ∈ X such that 휇(푓) = 0, ‖푓‖∞ ≤ 퐶푉 (푓). Let L0 ∈ B(X ) and assume that for some 휃 ∈ (0, 1) and all 푓 ∈ X :

‖L0푓‖∞ ≤ ‖푓‖∞ and 푉 (L0푓) ≤ 휃푉 (푓)

* and having eigenvalue 1 with an eigenprobability 휇0, i.e. L0휇0 = 휇0. Then L0 has a spectral gap (for the eigenvalue 1, the contraction being on the stable space ker 휇0) with constant 1, of size 1 − 휃 훿 = 0 1 + 퐶휃

The condition ‖푓‖∞ ≤ 퐶푉 (푓) is often valid in practice (assuming Ω has finite diameter for spaces such as Lip(Ω)): the condition that 휇(푓) = 0 implies that 푓 vanishes (if functions in X are continuous) or at least takes both non-positive and non-negative values, and 푉 (푓) usually bounds the variations of 푓, implying a bound on its uniform norm.

Proof. Let 푓 ∈ ker 휇0; then ‖L0푓‖∞ ≤ ‖푓‖∞ and L0푓 ∈ ker 휇0, so that ‖L0푓‖∞ ≤ 퐶푉 (L0푓) ≤ 퐶휃푉 (푓). Denote by 푡 ∈ [0, 1] the number such that ‖푓‖∞ = 푡‖푓‖ (and therefore 푉 (푓) = (1 − 푡)‖푓‖). The above two controls on ‖L0(푓)‖∞ can then be written as ‖L0(푓)‖∞ ≤ (︀ )︀ min 푡, 퐶휃(1 − 푡) ‖푓‖ and using 푉 (L0푓) ≤ 휃푉 (푓) again we get (︀ )︀ ‖L0(푓)‖ ≤ min 푡 + 휃(1 − 푡), (퐶 + 1)휃(1 − 푡) ‖푓‖ (︀ )︀ ‖(L0)| ker 휇0 ‖ ≤ max min 푡 + 휃(1 − 푡), (퐶 + 1)휃(1 − 푡) . 푡∈[0,1]

The maximum is reached when 푡+휃(1−푡) = (퐶 +1)휃(1−푡), i.e. when 푡 = 퐶휃/(1+퐶휃), at which point the value in the minimum is (퐶 + 1)휃/(퐶휃 + 1) ∈ (0, 1). Therefore there is a spectral gap with constant 1 and size 1 − (퐶 + 1)휃/(퐶휃 + 1), as claimed.

3.2 Chains with Doeblin’s minorization The simplest example of a Banach Algebra of functions is 퐿∞(Ω), the set of measurable bounded functions.2 To fit our framework, we will need to endow 퐿∞(Ω) with the following norm: ‖푓‖푆 = ‖푓‖∞ + sup |푓(푥) − 푓(푦)| 푥,푦∈Ω Of course, this norm is equivalent to the uniform norm, and it is easily checked what we still get a Banach Algebra. The point of this is that the semi-norm

푆(푓) = sup |푓(푥) − 푓(푦)| = sup 푓 − inf 푓 푥,푦∈Ω

2We do not have a single reference measure here, which is why we consider genuinely bounded functions rather than essentially bounded functions.

11 measures how “spread out” 푓 is, wich we need to manage separately from the magnitude of 푓. Observe that convergence of measures in duality to 퐿∞(Ω) is convergence in total variation, and the most usual normalization is ⃒ ⃒ 푑TV(휇, 휈) := sup ⃒휇(푓) − 휈(푓)⃒ 푆(푓)=1

For a transition kernel M, having an averaging operator L0 with a spectral gap is a very strong condition, called uniform ergodicity. Glynn and Ormoneit [GO02] and Kontoyiannis, Lastras-Montaño and Meyn [KLMM05] gave explicit concentration results for such chains, using the characterization of uniform ergodicity by the Doeblin minorization condition: there exist an integer ℓ ≥ 1, a positive number 훽 and a probability measure 휔 on Ω such that for all 푥 ∈ Ω and all Borel set 퐵 ⊂ Ω: ℓ 푚푥(퐵) ≥ 훽휔(퐵) (2) ℓ where 푚푥 is the law of 푋ℓ conditionally to 푋0 = 푥. We shall look at the case ℓ = 1, which fits better in our context. For arbitrary value of ℓ, one can in practice apply the result to each extracted chain (푋푘0+푘ℓ)푘≥0. Lemma 3.2. If M satisfies Doeblin’s minorization condition (2) with ℓ = 1, then its ∞ averaging operator L0 has a spectral gap on 퐿 (Ω) with constant 1 and size 훽/(2 − 훽). Proof. This is simply the classical maximal coupling method in a functional guise. For each 푥 ∈ Ω decompose 푚푥 into 훽휔 and 푟푥 := 푚푥 − 훽휔 (which is a positive measure of ∞ mass 1−훽). Recall that we denote by 휇0 the stationary measure of M. For all 푓 ∈ 퐿 (Ω) we have:

L0푓(푥) = 훽휔(푓) + 푟푥(푓) ∫︁ L0푓(푥) − L0푓(푦) = (푟푥(푓) − 푟푦(푓)) d휇0(푦) ∫︁ ⃒ ⃒ ⃒L0푓(푥) − L0푓(푦)⃒ ≤ (1 − 훽)푆(푓) d휇0(푦)

푆(L0푓) ≤ (1 − 훽)푆(푓).

We can thus apply Lemma 3.1 with 퐶 = 1 and 휃 = 1 − 훽, obtaining a spectral gap of size 훽/(2 − 훽). Theorem 3.3. If M satisfies Doeblin’s minorization condition (2) with ℓ = 1 and 휙 : Ω → [−1, 1], for all 푛 ≥ 120/훽 and all 푎 ≤ 훽/2 it holds [︁ ]︁ 훽 |휇ˆ (휙) − 휇 (휙)| ≥ 푎 ≤ 2.5 exp (︀ − 푛푎2 · )︀ P휇 푛 0 150 + 47훽

Proof. We have here ‖휙‖푆 ≤ 2 and, by Lemma 3.2, 훿0 ≥ 훽/(2 − 훽) ≥ 훽/2. It then suffices to apply TheoremA.

12 Runtime 푛 to ensure error below 푎 with probability ≥ 0.99 Value of 푎 Theorem 6.3 TheoremB[GO02]&[KLMM05] 0.001 2.76 × 1011 NA 1.18 × 1012 0.00003 3.07 × 1014 8.3 × 1012 1.3 × 1015

Table 1: Comparison with [GO02, KLMM05] for ℓ = 1, 훽 = 0.003, and observable 휙 :Ω → [0, 1].

The constant 150 in the rate is not so great, but we obtain a major improvement over [GO02, KLMM05] when 훽 is small (and 푎 in the Gaussian window): their rate is proportional to 훽2 instead of the correct order 훽 which we obtain. Let us give some concrete numerical estimates, sumed up in Table1. For 푎 = 0.001, 훽 = 0.003 and a 99% certainty, we need to take 푛 ≃ 2.76 × 1011 while [GO02, KLMM05] need 푛 ≃ 1.18 × 1012. For larger 푎, our exponential regime has a factor 훽2 but then we gain a factor of 푎. For very small 푎 we can appeal to TheoremB to obtain a much better rate. For this, 푘 푘 2 4 one only has to observe that 푆(L0휙¯) ≤ (1 − 훽) 푆(휙) to bound 휎 (휙) by any 푈 ≥ 1 + 훽 . The smaller is 푈, the best is the leading term in the rate (up to a best of −푛푎2훽/4) but the smaller is the allowed window (down to 푎 . 훽/72); for say 푎 = 0.00003 it suffices to take 푛 = 8.3 × 1012. As a matter of illustration, for 휙 :Ω → [−1, 1], 푎 = 0.00003 and 훽 = 0.003 Theorem B ensures that when 푛 ≃ 8.3 × 1012, the probability to get an error more than 푎 is less than 1/100. Meanwhile, [GO02, KLMM05] need at least 푛 ≃ 1.18 × 1015 so we gain two orders of magnitude, at the boundary of feasibility for cheaply simulated chains.

3.3 Discrete hypercube Let us start with the same toy example as Joulin and Ollivier [JO10], the lazy random walk (aka Gibbs sampler, aka Glauber dynamics) on the discrete hypercube {0, 1}푁 : the transition kernel M chooses randomly uniformly a slot 푖 ∈ {1, . . . , 푁} and replaces it with the result of a fair coin toss, i.e. 1 ∑︁ 1 푚 = 훿 + 훿 . 푥 2 푥 2푁 푦 푦∼푥 We consider two kind of observables: the “polarization”휌 : {0, 1}푁 → R giving the proportion of 1’s in its argument, and the characteristic function 1푆 of a subset 푆 ⊂ {0, 1}푁 . In this second example, we will in particular consider the simple case

푆 = [0] := {(0, 푥2, . . . , 푥푁 ): 푥푖 ∈ {0, 1}}. We state in Table2 the estimates only in the level of details that enables to compare with Joulin and Ollivier’s result. One sees that we obtain a weaker estimate in the case of 휌, but a better one in the case of 1[0] if we are careful enough.

13 Runtime to ensure error below 푎 with good probability Observable TheoremA Our best result Joulin-Ollivier with Lip. norm 1 (︀ 푁 )︀ (︀ 푁 )︀ (︀ 1 )︀ 푁 -Lip maps such as 휌 푂 푎2 푂 푎2 푂 푁 + 푎2 (︀ 푁 2 )︀ (︀ 푁 )︀ (︀ 푁 2 )︀ 1[0] 푂 푎2 푂 푎2 푂 푎2 (︀ 푁 2 )︀ (︀ 1 )︀ (︀ 푁 2 )︀ 1푆 where 푆 is “scrambled” 푂 푎2 푂 푎2 푂 푎2 Table 2: Comparison with [JO10], always assuming 푎 small enough.

The rest of this subsection explains how to get these estimates from our results.

3.3.1 Notation and functional spaces

푁 The discrete hypercube {0, 1} is endowed with the Hamming metric: if 푥 = (푥1, . . . , 푥푁 ) and 푦 = (푦1, . . . , 푦푁 ), then 푑(푥, 푦) is the number of indexes 푖 such that 푥푖 ̸= 푦푖. Two elements at distance 1 are said to be adjacent, denoted by 푥 ∼ 푦. We denote by 퐸 the set of tuples 휖 = (휖푖)1≤푖≤푁 such that exactly one of the 휖푖 is 1. Identifying {0, 1} with Z/2Z, an edge thus writes (푥, 푥 + 휖) for some 푥 ∈ {0, 1}푁 and some 휖 ∈ 퐸. We shall consider several function spaces to showcase the flexibility of the spectral method; since the space {0, 1}푁 is finite, we always consider the space of all functions {0, 1}푁 → R, and it is the considered norm which will matter. Let us define:

∙‖ 푓‖퐿 = ‖푓‖∞ + Lip(푓): this is the standard Lipschitz norm;

∙‖ 푓‖푑퐿 = ‖푓‖∞+푁 Lip(푓): this is the Lipschitz norm with a weigth to the regularity part equal to the diameter;

∙‖ 푓‖푊 = ‖푓‖∞ + 푊 (푓) where ∑︁ 푊 (푓) = sup |푓(푥 + 휖) − 푓(푥)|; 푁 푥∈{0,1} 휖∈퐸 this norm stays small for functions having large variations only in few directions (small “local total variation”).

Proposition 3.4. Each of the norm ‖·‖퐿, ‖·‖푑퐿 and ‖·‖푊 turns the space of all functions {0, 1}푁 → R into a Banach algebra where 1 has norm 1. Moreover the averaging operator L0 of the transition kernel M has operator norm 1, and spectral gap with constant 1 and respective size 1/푁 2, 1/(2푁 − 1) and 1/(4푁 − 1) in the norms ‖·‖퐿, ‖·‖푑퐿 and ‖·‖푊 . Proof. That the norms define Banach Algebras is proven as indicated in Remark 2.5. All the other properties but the spectral gap are trivial.

14 To prove the spectral gaps, we simply apply Lemma 3.1. First, it is well-known that for all 휙 : {0, 1}푁 → R, Lip(L0휙) ≤ (1 − 1/푁) Lip(휙) (in the parlance of [Oll09], M is positively curved with 휅 = 1/푁). In the case of ‖·‖퐿, we get 휃 = 1 − 1/푁 and 퐶 = 푁 (since a function of vanishing average must take positive and negative values, and diam{0, 1}푁 = 푁), hence a spectral 2 gap of size 1/푁 . In the case of ‖·‖푑퐿, the normalizing factor gives 퐶 = 1 (and we still have 휃 = 1 − 1/푁), hence a spectral gap of size 1/(2푁 − 1). To deal with ‖·‖푊 , we first show that in Lemma 3.1 we can take 휃 = 1 − 1/(2푁). ⃒ ⃒ ∑︁ ⃒1 1 ∑︁ 1 1 ∑︁ ⃒ 푊 (L0휙) = sup ⃒ 휙(푥 + 휖) + 휙(푥 + 휂 + 휖) − 휙(푥) − 휙(푥 + 휂)⃒ 푥 ⃒2 2푁 2 2푁 ⃒ 휖∈퐸 휂∈퐸 휂∈퐸 ⃒ ∑︁ ⃒(︁1 1 )︁ 1 ∑︁ = sup ⃒ − 휙(푥 + 휖) + 휙(푥 + 휂 + 휖) 푥 ⃒ 2 2푁 2푁 휖∈퐸 휂̸=휖 ⃒ (︁1 1 )︁ 1 ∑︁ ⃒ − − 휙(푥) − 휙(푥 + 휂)⃒ 2 2푁 2푁 ⃒ 휂̸=휖 푁 − 1 ∑︁ 1 ∑︁ ∑︁ ≤ sup |휙(푥 + 휖) − 휙(푥)| + |휙(푥 + 휖 + 휂) − 휙(푥 + 휂)| 푥 2푁 2푁 휖∈퐸 휖∈퐸 휂̸=휖 푁 − 1 1 ∑︁ ∑︁ ≤ 푊 (휙) + sup |휙(푦 + 휖) − 휙(푦)| 2푁 2푁 푥 푦∼푥 휖∈퐸 1 푊 (L 휙) ≤ (︀1 − )︀푊 (휙) 0 2푁 Then Lemma 3.5 below shows that we can take 퐶 = 1, providing a spectral gap of size 1/(4푁 − 1) The following optimal estimate and its proof where provided by Fedor Petrov on MathOverflow.

Lemma 3.5 (Fedor Petrov [Pet17]). For all 푓 : {0, 1}푁 → R we have max 푓 − min 푓 ≤ 푊 (푓).

Proof. Without lost of generality, we can assume 푊 (푓) ≤ 1 and 푓(0, 0,..., 0) = 0, and reduce to proving 푓(1, 1,..., 1) ≤ 1. 0 2 푘 ∑︀푘−1 푖+1 푖 Define the cost of a path 푥 , 푥 , . . . , 푥 as the number 푖=0 |푓(푥 ) − 푓(푥 )|, and let Σ be the sum of the costs of all paths of length 푁 from (0, 0,..., 0) to (1, 1,..., 1). We shall prove that Σ ≤ 푁!, and since there are 푁! such paths one of them will have cost at most 1, proving the lemma. We call “level” of 푥 ∈ {0, 1}푁 the number of 1s among the coordinates of 푥, and denote 푖!(푁−푖)! it by |푥|. For each 푖 ∈ {0, 1, . . . , 푁 − 1}, define 푝푖 = 푁+1 . Then all 푝푖 are positive and 푝푖 + 푝푖+1 = 푖!(푁 − 푖 − 1)! is precisely the number of paths that use any given edge from level 푖 to level 푖 + 1.

15 The contribution to Σ of an edge (푥, 푥 + 휖) from level 푖 to level 푖 + 1 is thus 푖!(푁 − 푖 − 1)!|푓(푥 + 휖) − 푓(푥)|, which we split into two parts, one 푝푖|푓(푥 + 휖) − 푓(푥)| attributed to 푥 and the other 푝푖+1|푓(푥 + 휖) − 푓(푥)| to 푥 + 휖. It follows

푁 푁−1 ∑︁ ∑︁ (︂푁)︂ ∑︁ (︂푁 − 1)︂ Σ ≤ 푝 푊 (푓) ≤ 푝 = (푝 + 푝 ) = 푁(푁 − 1)! = 푁! |푥| 푖 푖 푖 푖+1 푖 푥∈{0,1}푁 푖=0 푖=0 as desired.

3.3.2 Polarization

Consider the “polarization” observable 휌 : {0, 1}푁 → R, where 휌(푥) is the proportion of 1’s in the word 푥. We have 1 ‖휌‖ = 1 + , ‖휌‖ = 2, ‖휌‖ = 2. 퐿 푁 푑퐿 푊 To use TheoremA with optimal efficiency, assuming 푎 will be small enough, we need 2 to maximize 훿0/‖휌‖ . Here, we shall thus use the norm ‖·‖푑퐿. For 푎 . 푁, Theorem A shows that we need at most 푂(푁/푎2) iterations to have a good convergence to the actual mean; meanwhile Joulin and Ollivier only need 푂(1/푎2), but for concentration around the expectancy of the empiric process, not around the expectancy with respect to the stationary measure. Without burn-in, one also needs to bound the bias, which approaches zero in time 푂(푁/푎) according to the bound of Joulin and Ollivier, for a total run time of 푂(푁/푎 + 1/푎2). With burn-in, they need a run time of 푂(푁 + 1/푎2). For 1/푁 . 푎 . 1, we enter our exponential regime while staying inside Joulin-Ollivier’s Gaussian window; TheoremA shows we need no more than 푂(푁 2/푎) iterations, while [JO10] still gives a bound of 푂(푁 + 1/푎2). In this example, Joulin and Ollivier get a sharper result; this seems to be explained in one part by the fact that we do not get to decouple the bias from the convergence of expectancies, and in another part by our need to have a Banach algebra, hence to include the uniform norm in our norm.

3.3.3 Observable with small variance or small local total variation

Consider now the potential 1푆, the indicator function for a (non-trivial) set 푆. This function is only 1-Lipschitz, so that we have ‖1푆‖퐿 = 2 and ‖1푆‖푑퐿 = 1 + 푁. If we 2 insist on using a Lipschitz norm, the unormalized one is thus better and with 훿0 = 1/푁 TheoremA shows that we need (in the Gaussian regime) 푂(푁 2/푎2) iterations to ensure the error is probably less than 푎, which is the same order of magnitude than given by [JO10] with a worse constant, ∼ 34 instead of 8. But here we have two ways to improve on this bound. The first one is to use TheoremB. When

푁 푆 = [0] := {0푥2푥3 ··· 푥푁 ∈ {0, 1} },

16 the dynamical variance can be computed explicitly (distinguish the cases when the first digit has been changed an odd or even number of times, and observe that at each step the probability of changing the first digit is 1/2푁):

∑︁ 1 ∑︁ (︁푁 − 1)︁푘 푁 − 1 휇 (12 ) − (휇 1 )2 = 1/4 and 휇 (1 L푘1¯ ) = = 0 [0] 0 [0] 0 [0] 0 [0] 4 푁 4 푘≥1 푘≥1

2 2 This gives 휎 (1푆) ≃ 푁/2. Switching back to the norm ‖·‖푑퐿, when 푎 . 1/푁 (see Remark 2.8) and 푛 ≥ 60푁 2, in TheoremB the positive term in the exponential is negligible compared to the main term which is −푛푎2/푁. In particular 푂(푁/푎2) iterations suffice to get a small probability for a deviation at least 푎: compared to Joulin and Ollivier, we gain one power of 푁 in this regime (and the optimal constant 1 in the leading term of the rate) but only for very small values of 푎.3 This choice of 푆 might seem very specific, but for less regular 푆 the gain should be greater for sufficiently smaller 푎. For example, if 푆 contains half the vertices and every vertex 푥 ∈ {0, 1}푁 has exactly 2푁푝 neighbors 2 1 1−2푝 with the same 1푆 value, the above computation of variance gives 휎 (1푆) = 4 + 4푝 . 2 2 “Scrambled” sets with 푝 independent of 푁 get 휎 (1푆) ≃ 1 and taking 푛 = 푂(1/푎 ) is sufficient. The second way to improve our first estimate is to use the norm ‖·‖푊 in Theorem A. Then ‖1[0]‖푊 = 2 and 훿0 ≃ 1/푁. For 푎 . 1/푁, TheoremA ensures that we need only 푂(푁/푎2) iterations to have a good convergence to the actual mean, which is again the optimal order of magnitude (since it corresponds to the CLT) but obtained on a much larger window than with TheoremB. This extends to all observables with 푊 (휙) . 1; observe that this domain of applicability is quite complementary to the domain of applicability of the previous paragraph.

3.4 Bernoulli convolutions and observables of bounded variation The MCMC method is often used in high dimension, where it can be very difficult to simulate independent random variables of a given law. Let us give an example showing that even in dimension 1, using Markov chains can be efficient. We consider the “Bernoulli convolution” of parameter 휆 ∈ (0, 1), defined as the law 훽휆 of the random variable ∑︁ 푘 휖푘휆 푘≥1 where the 휖푘 are independent Bernoulli variables with parameter 1/2, i.e. 휖푘 is 1 with probability 1/2 and −1 with probability 1/2. When 휆 < 1/2, the support of 훽휆 is a Cantor set of zero Lebesgue measure, so that 훽휆 4 is singular . When 휆 = 1/2, 훽휆 is the uniform measure on [−1, 1]. But when 휆 ∈ (1/2, 1)

3If we want to consider 푎 of the order of 1/푁, we can then take 푈 ≃ 푁 2 to enlarge the window, at the cost of a weaker leading term. We get a bound similar to the one of Joulin-Ollivier, possibly with a smaller constant (depending on the value of 푎). 4Unless explicitly mentioned, in this subsection “singular” and “absolutely continuous” will always be meant with respect to Lebesgue measure.

17 √ 5−1 (b) 휆 = 푒 ≃ 0.680 (c) 휆 = 2 ≃ 0.667 (a) 휆 = 2 ≃ 0.618 4 3 Figure 1: Histogram of the empirical distribution of the Markov chain associated to (푇0, 푇1), with 푋0 = 0, binned in 500 subintervals (averaged image over 30 independent runs of 106 points each). Parameter 휆 is the inverse of a Pisot number on the left, a very well approximable irrational at the center, rational on the right.

(which we assume from now on), the question of the absolute continuity of 훽휆 is very difficult, and fascinating. It was proved by Erdös[Erd39] that if 휆 is the inverse of a Pisot number, then 훽휆 is singular, and a while later Solomyak discovered that for Lebesgue-almost all 휆, 훽휆 is absolutely continuous [Sol95]; see Figure1 for illustration. Important open questions are to find an explicit 휆 such that 훽휆 is absolutely continuous, and to find an explicit 휆, not the inverse of a Pisot number, such that 훽휆 is singular; see [PSS00] for more information on these questions. Finding an expression for the density of 훽휆 when it exists seems of course even further from reach; but of course, there is an obvious way to simulate 훽휆: draw the 휖푘 up to log 1/푝 some rank 퐾. To achieve a precision of 푝 on the result, one needs 퐾 ≃ log 1/휆 random bits for each independent sample. One can also realize naturally 훽휆 as the stationary law of the Markov transition kernel

M = (푚푥)푥∈R defined by 1 1 푚 = 훿 + 훿 푥 2 푇0(푥) 2 푇1(푥) where 푇0(푥) = 휆푥 − 휆 and 푇1(푥) = 휆푥 + 휆 (this is a particular case of an Iterated Function System (IFS)). In order to evaluate 훽휆(휙) by a MCMC method, one cannot use the methods developed 푘 for ergodic Markov chains since, conditionally to 푋0 = 푥, the law 푚푥 of 푋푘 is atomic 푘 and thus singular with respect to 훽휆: 푑TV(푚푥, 훽휆) = 1 for all 푘. The convergence only holds for observables satisfying some regularity assumption, and it is natural to ask what regularity is needed. For a Lipschitz observable 휙 one only need to observe that M has positive curvature 1 1 in the sense of Ollivier (this is easy using the coupling 2 훿(푇0(푥),푇0(푦)) + 2 훿(푇1(푥),푇1(푦)) of 푚푥 and 푚푦) and apply [JO10]. But what if 휙 is not Lipschitz (or has large Lipschitz constant)? We shall consider observables of bounded variation, a regularity which has the great advantage over Lipschitz to include the characteristic functions of intervals.

18 Definition 3.6. Given an interval 퐼 ⊂ R, we consider the BV(퐼) of bounded variation functions 퐼 → R, defined by the norm ‖·‖BV = ‖·‖∞ + var(·, 퐼) where

푝 ∑︁ var(푓, 퐼) := sup |푓(푥푗) − 푓(푥푗−1)| 푥 <푥 <···<푥 ∈퐼 0 1 푝 푗=1

(the uniform norm is usually replaced by the 퐿1 norm, but when 퐼 is bounded our choice is equivalent up to a constant, it does not single out the Lebesgue measure, and most importantly it ensures that BV(퐼) is a Banach algebra).

Important features of total variation are:

∙ its extensiveness: var(푓, 퐼) ≥ var(푓, 퐽) + var(푓, 퐾) whenever 퐽, 퐾 are disjoint subintervals of 퐼,

∙ its invariance under monotonic maps: var(푓 ∘ 푇, 퐼) = var(푓, 푇 (퐼)) whenever 푇 is monotonic.

The averaging operator L0 of the transition kernel M has a spectral gap for all 휆, but not with constant 1 when 휆 > 1/2. This only means that we will deal with an extracted Markov chain (푋ℓ푘)푘≥0 for some ℓ. Let 퐼휆 be the attractor of the IFS (푇0, 푇1), i.e. the interval whose endpoints are the fixed points of 푇0 and 푇1: −휆 휆 퐼 = [︀ , ]︀. 휆 1 − 휆 1 − 휆

Given a word 휔 = 휔1휔2 . . . 휔푘 in the letters 0 and 1, we define

푇휔 = 푇휔1 ∘ 푇휔2 ∘ · · · ∘ 푇휔푘 : 퐼휆 → 퐼휆.

ℓ 1 ℓ ℓ+1 Proposition 3.7. If 휆 < 2 , then L0 has a spectral gap on BV(퐼휆) of size 1/(2 − 1) and constant 1.

− + Proof. Let 퐼휆 , 퐼휆 be the left and right halves of 퐼휆, i.e. −휆 휆 퐼− = [︀ , 0)︀ 퐼+ = (︀0, ]︀. 휆 1 − 휆 휆 1 − 휆

ℓ 1 Let 푓 ∈ BV(퐼휆) and observe that the condition 휆 < 2 ensures that 푇00...0(퐼휆) and 1 푇11...1(퐼휆) are disjoint (they have length < 2 |퐼휆| and each contains an endpoint of 퐼휆). Then: 1 ∑︁ var(L 푓, 퐼 ) ≤ var(푓 ∘ 푇 , 퐼 ) 0 휆 2ℓ 휔 휆 휔∈{0,1}ℓ 1 ∑︁ ≤ var(푓, 푇 (퐼 )) 2ℓ 휔 휆 휔∈{0,1}ℓ

19 1 (︁ ∑︁ )︁ ≤ var(푓, 푇 (퐼 )) + var(푓, 푇 (퐼 ) + var(푓, 퐼 ) 2ℓ 00...0 휆 11...1 휆 휆 휔̸=00...0 ̸=11...1 1 ≤ (︀ var(푓, 퐼 ) + (2ℓ − 2) var(푓, 퐼 ))︀ 2ℓ 휆 휆 −ℓ var(L0푓, 퐼휆) ≤ (1 − 2 ) var(푓, 퐼휆).

Applying Lemma 3.1 with 퐶 = 1 and 휃 = 1 − 2−ℓ yields the claim.

This enables us to apply our result to estimate 훽휆(휙) for any 휙 of bounded variation. For example, TheoremA yields the following (where we only state the Gaussian regime, with the slightly better constant of Theorem 6.3).

1 ℓ 1 Theorem 3.8. Let 휆 ∈ ( 2 , 1) and let ℓ be the smallest integer such that 휆 < 2 . Consider −ℓ a Markov chain (푋푘)푘≥0 with transition probability 2 from 푥 ∈ 퐼휆 to 푇휔(푥), for each ℓ 휔 ∈ {0, 1} . For any starting distribution 푋0 ∼ 휇, any 휙 ∈ BV(퐼휆), any positive ℓ+1 ℓ 푎 < ‖휙‖BV/3(2 − 1) and any 푛 ≥ 120 · 2 we have

[︁ ]︁ (︁ 푛푎2 )︁ P휇 |휇ˆ푛(휙) − 휇0(휙)| ≥ 푎 ≤ 2.488 exp − 2 ℓ ‖휙‖BV(16.65 · 2 + 5.12) To the best of our knowledge, this example could not be handled effectively by previ- ously known results. For example [GD12] needs the observable to be at least 퐶2 to have explicit estimates, and they do not give a concentration inequality. For the sake of concreteness, fix any 휆 ∈ ( 1 , √1 ) so that the above applies with ℓ = 2. 2 2 Consider as observable a characteristic function 1퐽 where 퐽 ⊂ 퐼휆 is an (open, say) interval. We have ‖1퐽 ‖BV = 3. The Gaussian window is very large; for all 푎 ∈ (0, 1/7), to ensure an error at least 푎 occurs with probability less than 1/100 it is sufficient to take 푛 = 3561/푎2, i.e. 7122/푎2 random bits. The constant is somewhat large, and could probably be improved by taking a larger ℓ, finding more disjoint pairs of intervals inthe proof of the spectral gap, enabling one to get a much smaller 휃 in Lemma 3.1. Of course, to estimate 훽휆(1퐽 ) one could be tempted to use Hoeffding’s inequality (we would need only 푛 = 2.65/푎2 independent samples); but for any given random point 푌 with distribution 훽휆, it is very difficult to determine a priori which precision will ensure a correct value for 1퐽 (푌 ), and thus how many 휖푘 should be drawn. One can easily construct a stopping time (waiting for the distance between the current value and ∑︀ 푘 the boundary of the interval to be larger than 푘>퐾 휆 ), but since 훽휆 might be a very irregular measure, it seems quite difficult to control this stopping time a priori. One could also be tempted to bound 1퐽 from below and above by Lipschitz functions in order to apply [JO10], but one would need to ensure that these bounding functions are 1 퐿 (훽휆)-close one to another. For this one would need them to have very large Lipschitz constants (of the order of 1/푎 if one very conservatively assumes that 훽휆 is absolutely continuous with bounded density), making the total runtime of the order of 1/푎4 in the best case.

20 4 Connection with perturbation theory

To any 휙 ∈ X (sometimes called a “potential” in this role) is associated a weighted averaging operator, called a transfer operator in the dynamical context: ∫︁ 휙(푦) L휙푓(푥) = 푒 푓(푦) d푚푥(푦). Ω The classical guiding idea for the present work combines two observations. First, we have ∫︁ ∫︁ 2 휙(푥1) 휙(푥1) 휙(푥2) L휙푓(푥0) = 푒 L휙푓(푥1) d푚푥0 (푥1) = 푒 푒 푓(푥2) d푚푥1 (푥2) d푚푥0 (푥1) Ω Ω×Ω

푛 and by a direct induction, denoting by d푚푥0 (푥1, . . . , 푥푛) the law of 푛 steps of a Markov chain following the transition M and starting at 푥0, we have ∫︁ 푛 휙(푥1)+···+휙(푥푛) 푛 L휙푓(푥0) = 푒 푓(푥푛) d푚푥0 (푥1, . . . , 푥푛). Ω푛 In particular, applying to the function 푓 = 1, we get ∫︁ 푛 휙(푥1)+···+휙(푥푛) 푛 [︀ 휙(푋1)+···+휙(푋푛)]︀ L휙1(푥0) = 푒 d푚푥0 (푥1, . . . , 푥푛) = E푥0 푒 Ω푛 where (푋푘)푘≥0 is a Markov chain with transitions M and the subscript on expectancy and probabilities specify the initial distribution (푥0 being short for 훿푥0 ). It follows by linearity that if the Markov chain is started with 푋0 ∼ 휇 where 휇 is any 1 1 probability measure, then setting 휇ˆ푛휙 := 푛 휙(푋1) + ··· + 푛 휙(푋푛) we have ∫︁ [︀ ]︀ 푛 E휇 exp(푡휇ˆ푛휙) = L 푡 1(푥) d휇(푥) (3) 푛 휙

This makes a strong connection between the transfer operators and the behavior of 휇ˆ푛휙. 푡 Second, when the potential is small (e.g. 푛 휙 with large 푛), the transfer operator is a perturbation of L0, and their spectral properties will be closely related. This is the part that has to be made quantitative to obtain effective limit theorems. We will state the perturbation results we need after introducing some notation. The letter L will always denote a bounded linear operator, and ‖·‖ will be used both for the norm in X and for the operator norm. From now on it is assumed that L0 has a spectral gap of size 훿0 and constant 1. In [Klo17b] the leading eigenvalue of L0 is denoted by 휆0, * an eigenvector is denoted by 푢0, and an eigenform (eigenvector of L0) is denoted by 휑0 (similarly the eigenvalue of an operator L close to L0 is denoted by 휆L). Two quantities appear in the perturbation results below. The first one is the condition ‖휑0‖‖푢0‖ number 휏0 := . To define the second one, we need to introduce 휋0, the projection |휑0(푢0)| on 퐺0 along ⟨푢0⟩, which here writes 휋0(푓) = 푓 − 휇0(푓), and observe that by the spectral

21 hypothesis (L0 − 휆0) is invertible when acting on 퐺0. Then the spectral isolation is defined as 훾 := ‖(L − 휆 )−1 휋 ‖. 0 0 0 |퐺0 0

We shall denote by P0 the projection on ⟨푢0⟩ along 퐺0, and set R0 = L0 ∘ 휋0. We then have the expression L0 = 휆0P0 + R0 with P0R0 = R0P0 = 0. This decomposition will play a role below, and can be done for all L with a spectral gap: we denote by 휆L, 휋L, PL, RL the corresponding objects for L, and by 휆, 휋, P, R we mean the corresponding maps L ↦→ 휆L, etc. Last, the notation 푂퐶 (·) is the Landau notation with an explicit constant 퐶, i.e. 푓(푥) = 푂퐶 (푔(푥)) means that for all 푥, |푓(푥)| ≤ 퐶|푔(푥)|. Theorem 4.1 (Theorems 2.3 and 2.6 and Proposition 5.1 (viii) of [Klo17b]). All L 1 such that ‖L − L0‖ < have a simple isolated eigenvalue; 휆, 휋, P, R are defined and 6휏0훾0 analytic on this ball. 퐾 − 1 Given any 퐾 > 1, whenever ‖L − L0‖ ≤ we have 6퐾휏0훾0 (︀ )︀ 휆L = 휆0 + 푂 퐾−1 ‖L − L0‖ 휏0+ 3 (︀ 2)︀ 휆L = 휆0 + 휑0(L − L0)푢0 + 푂퐾휏0훾0 ‖L − L0‖ (︁ 3)︁ 휆 = 휆 + 휑 (L − L )푢 + 휑 (L − L )S (L − L )푢 + 푂 2 2 2 ‖L − L ‖ L 0 0 0 0 0 0 0 0 0 2퐾 휏0 훾0 0

PL = P0 + 푂2퐾휏0훾0 (‖L − L0‖)

휋L = 휋0 + 푂 퐾−1 (‖L − L0‖) 휏0+ 3 ⃦ [︁ 1 ]︁ ⃦ 1 휏 + 퐾−1 ⃦ ⃦ 0 3 퐷 R ≤ + 2 ‖L‖ + 2퐾휏0훾0. ⃦ 휆 L⃦ |휆L| |휆L|

Theorem 4.2 (Corollaire 2.12 from [Klo17b]). In the case 휆0 = ‖L0‖ = 1, all L such that 훿0(훿0 − 훿) ‖L − L0‖ ≤ 6(1 + 훿0 − 훿)휏0‖휋0‖ have a spectral gap of size 훿 below 휆L, with constant 1.

Since we will apply these results to the averaging operator L0, we need to evaluate the parameters in this case. We have 휆0 = 1, 푢0 = 1 and 휑0 is identified with the stationary measure 휇0. It first follows that 휏0 = 1.

Indeed ‖푢0‖ = 1 by hypothesis, ‖휑0‖ = 1 since ‖·‖ ≥ ‖·‖∞ and 휑0 is a probability measure, and |휑0(푢0)| = |휇0(1)| = 1. Then we have ‖휋0‖ ≤ 2

22 since for all 푓 ∈ X , we have 휋0(푓) = 푓 − 휇0(푓) and ‖휇0(푓)1‖ = |휇0(푓)| ≤ ‖푓‖∞ ≤ ‖푓‖. In general this trivial bound can hardly be improved without more information, notably on 휇0: it may be the case that 휇0 is concentrated on a specific region of the space, and then 푓 − 휇0(푓) could have norm close to twice the norm of 푓. −1 ∑︀ 푘 Last, from the Taylor expansion (1 − L0) = 푘≥0 L0, the spectral gap 훿0, and the upper bound on ‖휋0‖ we deduce 훾0 ≤ 2/훿0.

5 Main estimates

Standing assumption 2.3 ensures that for all small enough 휙 we can apply the above perturbation results; recall that 휇0 is the stationary measure, so that for all 푓 ∈ X we ∫︀ ∫︀ have L0푓 d휇0 = 푓 d휇0. We will first apply Theorem 4.2 with 훿 = 훿0/13; this is somewhat arbitrary, but the exponential decay will be strong enough compared to other quantities that we don’t need 훿 to be large. Taking it quite small allow for a larger radius where the result applies. As a consequence of this choice, the following smallness assumption will often be needed: (︁ 훿2 )︁ ‖휙‖ ≤ log 1 + 0 . (4) 13 + 12훿0

We will often use 휙 instead of L휙 in subscripts: for example 휆휙 is the main eigenvalue of L휙 and 휋휙 is linear projection on its eigendirection along the stable complement appearing in the definition of the spectral gap. Lemma 5.1. We have (︁ ∑︁ 휙푗 )︁ L (·) = L · and ‖L − L ‖ ≤ 푒‖휙‖ − 1. 휙 0 푗! 휙 0 푗≥0 If (4) holds, then we have

2 훿0 1 ‖L휙 − L0‖ ≤ ≤ 13 + 12훿0 25 L휙 = L0 + 푂1.02(‖휙‖) 2 = L0 + L0(휙·) + 푂0.507(‖휙‖ ) 1 = L (︀(1 + 휙 + 휙2) · )︀ + 푂 (‖휙‖3), 0 2 0.169 ‖휋휙‖ ≤ 2.053

Assumption (4) is in particular sufficient to apply Theorem 4.2 with 훿 = 훿0/13 and Theorem 4.1 with 퐾 = 1 + 12훿0/13.

Proof. The first formula is a rephrasing of the definition of L휙; observe then that thanks to the assumption that X is a Banach algebra, we have (︀ 휙 )︀ ‖L휙 − L0‖ = ‖L0 (푒 − 1) · ‖

23 ∞ ⃦ ∑︁ 휙푗 ⃦ ≤ ‖L ‖⃦ ⃦ 0 ⃦ 푗! ⃦ 푗=1 ∞ ∑︁ ‖휙‖푗 ≤ 푗! 푗=1 ‖휙‖ ‖L휙 − L0‖ ≤ 푒 − 1 Observing that 푥 ↦→ 푥2/(13 + 12푥) is increasing from 0 to 1/25 as 푥 varies from 0 to 1 completes the uniform bound of ‖L휙 − L0‖ and gives ‖휙‖ ≤ log(1 + 1/25) := 푏. By convexity, we deduce that ‖휙‖ 푒‖휙‖ − 1 ≤ (푒푏 − 1) ≤ 1.02‖휙‖ 푏 and the zeroth order Taylor formula follow. The higher-order estimates are obtained similarly:

(︀ 휙 )︀ 휙 L휙 = L0 (1 + 휙 + (푒 − 휙 − 1)) · = L0 + L0(휙·) + 푂‖L0‖(푒 − 휙 − 1)

푒푥−푥−1 and using the triangle inequality, the convexity of 푥 and the bound on 휙: 푒‖휙‖ − ‖휙‖ − 1 푒푏 − 푏 − 1 ‖푒휙 − 휙 − 1‖ ≤ ‖휙‖ ≤ ‖휙‖2 ≤ 0.507‖휙‖2. ‖휙‖ 푏2 The second order remainder is bounded by 1 푒푏 − 1 푏2 − 푏 − 1 ‖푒휙 − 휙2 − 휙 − 1‖ ≤ 2 ‖휙‖3 ≤ 0.169‖휙‖3 2 푏3 and finally, we have 4훿 4 1 ‖휋 ‖ ≤ ‖휋 ‖ + (︀1 + 0 )︀‖L − L ‖ ≤ 2 + (︀1 + )︀ ≤ 2.053. 휙 0 13 휙 0 13 25

Lemma 5.2. Under (4) we have

|휆휙 − 1| ≤ 0.0524

휆휙 = 1 + 푂1.334(‖휙‖) 2 휆휙 = 1 + 휇0(휙) + 푂 −1 (‖휙‖ ) 2.43+2.081훿0 1 2 ∑︁ 푘 3 휆휙 = 1 + 휇0(휙) + 휇0(휙 ) + 휇0(휙L0(휙 ¯)) + 푂7.41+17.75훿−1+8.49훿−2 (‖휙‖ ) 2 0 0 푘≥1

퐾−1 Proof. With 퐾 = 1 + 12훿0/13 we have 휏0 + 3 = 1 + 4훿0/13 and by the Theorem 4.1, 17 L ↦→ 휆L has Lipschitz constant at most 1+4/13 = 17/13. We get |휆휙 −휆0| ≤ 13 ‖L휙 −L0‖ from which we deduce both 17 |휆 − 1| ≤ ≤ 0.0524 휙 13 × 25

24 17 and |휆 − 1| ≤ 1.02‖휙‖ ≤ 1.334‖휙‖ 휙 13

−1 Now we use the first-order Taylor formula for 휆, using 퐾휏0훾0 ≤ 2훿0 (1 + 12훿0/13) = 24 −1 13 + 2훿0 : (︀ )︀ 2 휆휙 = 1 + 휇0 (L휙1 − L01) + 푂 24 −1 (‖L휙 − L0‖ ), 13 +2훿0 2 then using L휙1 − L01 = L0(휙) + 푂0.507(‖휙‖ ) from Lemma 5.1 we get

2 2 휇0(L휙1 − L01) = 휇0(L0(휙)) + 푂0.507(‖휙‖ ) = 휇0(휙) + 푂0.507(‖휙‖ ).

2 Using ‖L휙 − L0‖ ≤ 1.02‖휙‖ gives the following constant in the final 푂(‖휙‖ ) of the first-order formula: 24 0.507 + (1.02)2( + 2훿−1) ≤ 2.43 + 2.081훿−1. 13 0 0 Then we apply the second-order Taylor formula:

(︁ )︁ 3 휆휙 = 1 + 휇0(L휙1 − L01) + 휇0 (L휙 − L0)S0(L휙1 − L01) + 푂 2 −2 (‖L휙 − L0‖ ). 8퐾 훿0

1 2 3 Using L휙1 − L01 = L0(휙 + 2 휙 ) + 푂0.169(‖휙‖ ) from Lemma 5.1 we first get 1 휇 (L 1 − L 1) = 휇 (휙) + 휇 (휙2) + 푂 (‖휙‖3). 0 휙 0 0 2 0 0.169

2 To simplify the second term, we recall that L휙 − L0 = L0(휙·) + 푂0.507(‖휙‖ ) and

−1 (︀ ∑︁ 푘)︀ S0 = (1 − L0) 휋0 = L0 휋0 푘≥0 ¯ where 휋0 is the projection on ker 휇0 along ⟨1⟩, i.e. 휋0(푓) = 푓 − 휇0(푓) =: 푓, and has norm at most 2. We thus have (noticing that in the second line both the main term and the remainder term belong to ker 휇0):

(︀ 2 )︀ 휋0(L휙1 − L01) = 휋0 L0(휙) + 푂0.507(‖휙‖ ) 2 = L0(휙 ¯) + 푂1.014(‖휙‖ ) ∑︁ 푘 2 S0(L휙1 − L01) = L (휙 ¯) + 푂 −1 (‖휙‖ ). 0 1.014훿0 푘≥1 We also have 2 2.04 ‖S0(L휙1 − L01)‖ ≤ ‖L휙1 − L01‖ ≤ ‖휙‖. 훿0 훿0 It then comes

(︀ ∑︁ 푘 )︀ 2 (L휙 − L0)S0(L휙1 − L01) = L0 휙 L (휙 ¯) + 푂 −1 (‖L휙 − L0‖‖휙‖ ) 0 1.014훿0 푘≥1 2 + 푂0.507(‖휙‖ ‖S0(L휙1 − L01)‖)

25 (︀ ∑︁ 푘 )︀ 3 = L0 휙 L (휙 ¯) + 푂 −1 (‖휙‖ ) 0 2.07훿0 푘≥1 ∑︁ 푘 3 휇0(L휙 − L0)S0(L휙1 − L01) = 휇0(휙L (휙 ¯)) + 푂 −1 (‖휙‖ ) 0 2.07훿0 푘≥1 where the reversal of sum and integral is enabled by normal convergence. Last, we observe 12 8퐾2훿−2 = 8( + 훿−1)2 ≤ 6.82 + 14.77훿−1 + 8훿−2, 0 13 0 0 0 and we gather all what precedes:

(︁ )︁ 3 휆휙 = 1 + 휇0(L휙1 − L01) + 휇0 (L휙 − L0)S0(L휙1 − L01) + 푂 2 −2 (‖L휙 − L0‖ ) 8퐾 훿0

1 2 3 ∑︁ 푘 3 = 1 + 휇0(휙) + 휇0(휙 ) + 푂0.169(‖휙‖ ) + 휇0(휙L0(휙 ¯)) + 푂2.07훿−1 (‖휙‖ ) 2 0 푘≥1 3 + 푂 −1 −2 3 (‖휙‖ ) (6.82+14.77훿0 +8훿0 )1.02 1 2 ∑︁ 푘 3 = 1 + 휇0(휙) + 휇0(휙 ) + 휇0(휙L0(휙 ¯)) + 푂7.41+17.75훿−1+8.49훿−2 (‖휙‖ ) 2 0 0 푘≥1

Under assumption (4), we know that L휙 has a spectral gap of size 훿0/13 with constant 1, and we can write L휙 = 휆휙P휙 + R휙 where P휙 is the projection to the eigendirection along the stable complement and R휙 = L휙휋휙 is the composition of the projection to the stable complement and L휙. Then it holds P휙R휙 = R휙P휙 = 0, so that for all 푛 ∈ N:

푛 푛 푛 L휙 = 휆휙P휙 + R휙. Lemma 5.3. Under assumption (4), it holds

푛 ⃦(︁ 1 )︁ ⃦ −1 푛−1 ⃦ R휙 1⃦ ≤ (6.388 + 4.08훿0 )(1 − 훿0/13) ‖휙‖ 휆휙

P휙1 = 1 + 푂 −1 (‖휙‖). 3.77+4.08훿0

Proof. At any L = L휙 where 휙 satisfies (4) we have:

⃦ [︁ 1 ]︁ ⃦ 1 17/13 ⃦퐷 R ⃦ ≤ + 2 |L| + 2퐾휏0훾0 휆 L |휆L| |휆L| 1 17 48 4 ≤ + 2 × 1.04 + + 0.9476 13 × 0.9476 13 훿0 4 ≤ 6.263 + 훿0

26 so that ⃦ 1 1 ⃦ 4 ⃦ R휙1 − R01⃦ ≤ (6.263 + )‖L휙 − L0‖‖1‖ 휆휙 휆0 훿0 ⃦ 1 ⃦ 4 ⃦ R휙1 − 0⃦ ≤ 1.02(6.263 + )‖휙‖ 휆휙 훿0

⃦ 1 ⃦ −1 ⃦ R휙1⃦ ≤ (6.388 + 4.08훿0 )‖휙‖. 휆휙

Moreover since RL takes its values in 퐺L where 휋L acts as the identity, we have

푛 푛−1 푛−1 ‖R휙1‖ ≤ 휆휙 (1 − 훿0/13) ‖RL1‖ from which the first inequality follows.

Then we have P휙 = P0 + 푂2퐾휏0훾0 (‖L휙 − L0‖), which yields the claimed result using −1 퐾 = 1 + 12훿0/13, 휏0 = 1, 훾0 ≤ 2훿0 and ‖L휙 − L0‖ ≤ 1.02‖휙‖. 푛 This control of P휙 and R휙 can be then be used to reduce the estimation of L휙1 to the 푛 estimation of 휆휙. Corollary 5.4. Under assumptions (4) and log 100 푛 ≥ 1 + (5) − log(1 − 훿0/13) it holds

푛 푛(︀ )︀ L 1 = 휆 1 + 푂 −1 (‖휙‖) 휙 휙 3.834+4.121훿0 푛 (︀ 2 )︀ 휆 = exp 푛휇0(휙) + 푂 −1 (푛‖휙‖ ) 휙 3.36+2.081훿0 푛 (︀ 1 2 3 )︀ 휆휙 = exp 푛휇0(휙) + 푛휎 (휙) + 푂10.89+20.04훿−1+8.577훿−2 (푛‖휙‖ ) 2 0 0 Proof. Assuming (4), we first appeal to Lemma 5.3 to write:

푛 푛 푛 L휙1 = 휆휙P휙1 + R휙1 푛(︁ (︀ 푛−1 )︀)︁ = 휆 1 + 푂 −1 (‖휙‖) + 푂 −1 (1 − 훿0/13) ‖휙‖ (6) 휙 3.77+4.08훿0 6.388+4.08훿0 The second factor of (6) is easily controlled if we ask (5), under which we have

(︀ 푛−1 )︀ 퐴 := 1 + 푂 −1 (‖휙‖) + 푂 −1 (1 − 훿0/13) ‖휙‖ 3.77+4.08훿0 6.388+4.08훿0

= 1 + 푂 −1 (‖휙‖) + 푂 −1 (‖휙‖) 3.77+4.08훿0 0.064+0.041훿0

= 1 + 푂 −1 (‖휙‖) 3.834+4.121훿0 푛 The first estimate for 휆휙 is obtained through the first-order Taylor formula. We use the monotony and convexity of 푥 ↦→ (log(1 + 푥) − 푥)/푥 and set 푥 = 휆휙 − 1 ∈ [−푏, 푏] with 푏 = 0.0524 to evaluate log(휆휙): ⃒log(1 + 푥) − 푥⃒ log(1 − 푏) + 푏 |푥| ⃒ ⃒ ≤ ⃒ 푥 ⃒ −푏 푏

27 ≤ 0.52|푥|

log(휆휙) = log(1 + 휆휙 − 1) 2 = 휆휙 − 1 + 푂0.52(|휆휙 − 1| ) 2 = 휆휙 − 1 + 푂0.52×1.3342 (‖휙‖ ) 2 = 휆휙 − 1 + 푂0.926(‖휙‖ ).

2 and then using 휆휙 = 1 + 휇0(휙) + 푂 −1 (‖휙‖ ) from Lemma 5.2: 2.43+2.081훿0

푛 (︀ )︀ 휆휙 = exp 푛 log(휆휙) (︀ 2 )︀ = exp 푛(휆휙 − 1) + 푂0.926(푛‖휙‖ ) (︀ 2 )︀ = exp 푛휇0(휙) + 푂 −1 (푛‖휙‖ ) . 3.36+2.081훿0 푛 The second estimate for 휆휙 is obtained, of course, from the second-order formula given in Lemma 5.2:

1 2 ∑︁ 푘 3 휆휙 = 1 + 휇0(휙) + 휇0(휙 ) + 휇0(휙L0(휙 ¯)) + 푂7.41+17.75훿−1+8.49훿−2 (‖휙‖ ). 2 0 0 푘≥1 Here, it is somewhat tedious to use a convexity argument and we instead use the slightly less precise Taylor formula: for 푥 ∈ [−푏, 푏] (where again 푏 = 0.0524) we have

⃒1 d3 ⃒ 2 ⃒ log(1 + 푥)⃒ ≤ ≤ 0.392 ⃒6 d푥3 ⃒ 6(1 − 0.0524)3 so that 1 log(1 + 푥) = 푥 − 푥2 + 푂 (푥3) 2 0.392 and therefore (using at one step |휇0(휙)| ≤ ‖휙‖): 1 log(휆 ) = (휆 − 1) − (휆 − 1)2 + 푂 ((휆 − 1)3) 휙 휙 2 휙 0.392 휙 1 2 ∑︁ 푘 3 = 휇0(휙) + 휇0(휙 ) + 휇0(휙L0휙¯) + 푂7.41+17.75훿−1+8.49훿−2 (‖휙‖ ) 2 0 0 푘≥1

1(︀ 2 )︀2 3 − 휇0(휙) + 푂2.43+2.081훿−1 (‖휙‖ ) + 푂0.392×1.3343 (‖휙‖ ) 2 0 1 2 3 4 = 휇0(휙) + 휎 (휙) + 푂10.771+19.831훿−1+8.49훿−2 (‖휙‖ ) + 푂2.953+5.06훿−1+2.166훿−2 (‖휙‖ ) 2 0 0 0 0 Now assumption (4) ensures ‖휙‖ ≤ 0.04, so that we can combine the two error terms 3 into 푂푎(‖휙‖ ) with

−1 −2 −1 −2 푎 = 10.771 + 19.831훿0 + 8.49훿0 + 0.04(2.953 + 5.06훿0 + 2.166훿0 ) −1 −2 ≤ 10.89 + 20.04훿0 + 8.577훿0

28 6 Concentration inequalities

푡 We will in this section apply Corollary 5.4 to 푛 휙 instead of 휙, which we can do as soon as 푛 is large enough with respect to 푡 and ‖휙‖ in the sense that ‖푡휙‖ log 100 푛 ≥ and 푛 ≥ 1 + , (7) (︁ 훿2 )︁ log 1 + 0 − log(1 − 훿0/13) 12+13훿0 Remark 6.1. The first condition can be replaced by any of the following stronger but simpler conditions

−1 −2 ‖푡휙‖ 푛 ≥ (13.3훿0 + 12.3훿0 )‖푡휙‖ or 푛 ≥ 26 2 훿0 Similarly, by an elementary function analysis the second condition can be replaced by 60 푛 ≥ . 훿0 Under these conditions, we obtain our first control of the moment generating function of the empiric mean 1 1 휇ˆ (휙) := 휙(푋 ) + ··· + 휙(푋 ) 푛 푛 1 푛 푛 by plugging the first-order estimate of Corollary 5.4 in (3):

[︀ exp(푡휇ˆ (휙))]︀ ∫︁ E휇 푛 −푡휇0(휙) 푛 = 푒 L 푡 휙1(푥) d휇(푥) exp(푡휇0(휙)) 푛 2 (︀ 푡 )︀ 푡 2 = 1 + 푂3.834+4.121훿−1 ( ‖휙‖) exp(푂3.36+2.081훿−1 ( ‖휙‖ )) 0 푛 0 푛 By the classical Chernov bound, it follows that for all 푎, 푡 > 0: [︀ ]︀ P휇 |휇ˆ푛(휙) − 휇0(휙)| ≥ 푎 푡 푡2 ≤ (︀2 + (7.668 + 8.242훿−1) ‖휙‖)︀ exp (︀ − 푎푡 + (3.36 + 2.081훿−1) ‖휙‖2))︀ (8) 0 푛 0 푛 6.1 Gaussian regime Our first concentration inequality is obtained by choosing 푡 to optimize the argument of the exponential in (8), i.e. taking 푛푎 푡 = . −1 2 2(3.36 + 2.081훿0 )‖휙‖ This choice can be made as soon as 푎 is small enough: indeed the first condition on 푛 then reads 푛푎 푛 ≥ (︁ 훿2 )︁ 2(3.36 + 2.081훿−1) log 1 + 0 ‖휙‖ 0 12+13훿0

29 i.e. 2 −1 (︁ 훿0 )︁ 푎 ≤ (6.72 + 4.162훿0 ) log 1 + ‖휙‖. 12 + 13훿0 Let us find a simpler lower bound for the right-hand side: 2 2 −1 (︁ 훿0 )︁ −1 훿0 (6.72 + 4.162훿0 ) log 1 + ≥ (6.72 + 4.162훿0 ) · 0.98 12 + 13훿0 12 + 13훿0 6.58훿0 + 4 ≥ 훿0 13훿0 + 12 훿 ≥ 0 3 so that a sufficient condition to make the above choice for 푡 is 훿 ‖휙‖ 푎 ≤ 0 . (9) 3 Then the argument in the exponential becomes 푡2 푛푎2 −푎푡 + (3.36 + 2.081훿−1) ‖휙‖2) ≤ − 0 −1 2 푛 (13.44 + 8.324훿0 )‖휙‖ and the constant in front: −1 −1 푡 (7.668 + 8.242훿0 )푎 2 + (7.668 + 8.242훿0 ) ‖휙‖ ≤ 2 + −1 푛 (6.72 + 4.162훿0 )‖휙‖ 7.668훿2 + 8.242훿 ≤ 2 + 0 0 20.16훿0 + 12.486 7.668 + 8.242 ≤ 2 + ≤ 2.488 ≤ 2.5 20.16 + 12.486 Remark 6.2. We could also have bounded the front constant in a different way to show it can be taken close to 2 for small 푎: −1 −1 푡 (7.668 + 8.242훿0 )푎 2 + (7.668 + 8.424훿0 ) ‖휙‖ ≤ 2 + −1 푛 (6.72 + 4.162훿0 )‖휙‖ 8.242 푎 ≤ 2 + 4.162 ‖휙‖ 푎 ≤ 2 + 2 ‖휙‖ We obtain a version of the first part of TheoremA. Theorem 6.3. For all 푛, 푎 such that log 100 훿 ‖휙‖ 푛 ≥ 1 + and 푎 ≤ 0 − log(1 − 훿0/13) 3 it holds 2 [︁ ]︁ (︁ 훿0 푎 )︁ P휇 |휇ˆ푛(휙) − 휇0(휙)| ≥ 푎 ≤ 2.488 exp − 푛 2 13.44훿0 + 8.324 ‖휙‖

30 A simpler, less precise estimate is [︁ ]︁ (︁ 푎2 )︁ |휇ˆ (휙) − 휇 (휙)| ≥ 푎 ≤ 2.5 exp − 푛 · 0.046훿 P휇 푛 0 0 ‖휙‖2

6.2 Exponential regime For larger 푎, we obtain a result with exponential decay by taking 푡 as large as allowed by the first smallness condition (7), i.e. 푛 (︁ 훿2 )︁ 푡 ≃ log 1 + 0 . ‖휙‖ 12 + 13훿0 To simplify, we precisely take the slightly smaller 푛 0.98훿2 푡 = × 0 ‖휙‖ 12 + 13훿0 Then the argument in the exponential becomes

2 2 2 −1 푡 2 0.98훿0 (︁ 푎 0.98(3.36훿0 + 2.081훿0))︁ −푎푡 + (3.36 + 2.081훿0 ) ‖휙‖ ) = 푛 − + 푛 12 + 13훿0 ‖휙‖ 12 + 13훿0 2 0.98훿0 (︁ 푎 )︁ ≤ −푛 − 0.254훿0 12 + 13훿0 ‖휙‖ and the constant in front: 2 −1 푡 −1 0.98훿0 2 + (7.668 + 8.242훿0 ) ‖휙‖ = 2 + (7.668 + 8.242훿0 ) 푛 12 + 13훿0 7.515훿2 + 8.078훿 = 2 + 0 0 12 + 13훿0 15.593 ≤ 2 + ≤ 2.624 25 We obtain a version of the second part of TheoremA. Theorem 6.4. For all 푛, 푎 such that log 100 훿 ‖휙‖ 푛 ≥ 1 + and 푎 ≥ 0 − log(1 − 훿0/13) 3 it holds 2 [︁ ]︁ (︁ 0.98훿0 (︁ 푎 )︁)︁ P휇 |휇ˆ푛(휙) − 휇0(휙)| ≥ 푎 ≤ 2.624 exp − 푛 − 0.254훿0 . 12 + 13훿0 ‖휙‖ A simpler, less precise estimate is: [︁ ]︁ (︁ 푎 )︁ |휇ˆ (휙) − 휇 (휙)| ≥ 푎 ≤ 2.7 exp − 푛 · 0.009훿2 . P휇 푛 0 0 ‖휙‖

31 6.3 Second-order concentration In the case one has a good upper bound for the variance

2 2 2 ∑︁ 푘 휎 (휙) = 휇0(휙 ) − (휇0휙) + 2 휇0(휙L0휙¯) 푘≥1 then the previous concentration results can be improved by using the second-order for- mula in Corollary 5.4, which yields

[︀ ]︀ 2 3 E휇 exp(푡휇ˆ푛(휙)) (︀ 푡 2 푡 3 )︀ −1 −2 = exp 휎 (휙) + 푂10.89+20.04훿 +8.577훿 ( 2 ‖휙‖ ) exp(푡휇0(휙)) 2푛 0 0 푛 (︀ 푡 )︀ × 1 + 푂3.834+4.121훿−1 ( ‖휙‖) 0 푛 so that, if we know 휎2(휙) ≤ 푈:

(7.668 + 8.242훿−1)푡 푡2 푡3 [︀|휇ˆ (휙)−휇 (휙)| ≥ 푎]︀ ≤ (︀2+ 0 ‖휙‖)︀ exp (︀−푎푡+ 푈+퐶 ‖휙‖3)︀ P휇 푛 0 푛 2푛 푛2 −1 −2 where 퐶 can be any number above 10.89 + 20.04훿0 + 8.577훿0 . To get a compact −1 −2 expression, we observe that 0.89 + 0.04훿0 ≤ 0.93훿0 so that −1 −2 −1 −2 −1 2 10.89 + 20.04훿0 + 8.577훿0 ≤ 10 + 20훿0 + 9.507훿0 ≤ 10(1 + 훿0 ) =: 퐶. The choice of 푡 can then be adapted to the circumstances; we will only explore the choice 푡 = 푎푛/푈 which is nearly optimal when 푎 is small. This choice can be made as soon as 푈 (︁ 훿2 )︁ 푎 ≤ log 1 + 0 ‖휙‖ 12 + 13훿0 and entails the following upper bound for the front constant: 2 −1 훿0 7.668 + 8.242 2 + (7.668 + 8.242훿0 ) ≤ 2 + ≤ 2.637 12 + 13훿0 12 + 13 Meanwhile, the exponent becomes 푡2 푡3 푎2푛 퐶‖휙‖3푎3푛 −푎푡 + 푈 + 퐶 ‖휙‖3 = − + 2푛 푛2 2푈 푈 3 푈 2 which, given 휇ˆ푛(휙) satisfies the Central Limit Theorem, is nearly optimal if 푎 ≪ 2퐶‖휙‖3 and 푈 is close to 휎2(휙). We obtain TheoremB, in the following version. 2 Theorem 6.5. For all 푛 ≥ 60/훿0, if 휎 (휙) ≤ 푈 and 푈 (︁ 훿2 )︁ 푎 ≤ log 1 + 0 ‖휙‖ 12 + 13훿0 then it holds (︁ 푎2 ‖휙‖3푎3 )︁ [︀|휇ˆ (휙) − 휇 (휙)| ≥ 푎]︀ ≤ 2.637 exp − 푛 · (︀ − 10(1 + 훿−1)2 )︀ P휇 푛 0 2푈 0 푈 3

32 7 Berry-Esseen bounds

In this section, we use the second-order Taylor formula for the leading eigenvalue to prove effective Berry-Esseen bounds. The method we use is the one proposed byFeller [Fel66], which does not yield the best constant in the IID case, but is quite easily adapted to the Markov or dynamical case as observed in [CP90]. The starting point is a “smoothing” argument that allows to translate the proximity of characteristic functions into a proximity of distribution functions. Proposition 7.1 ([Fel66]). Let 퐹, 퐺 be the distribution functions and 휑, 훾 be the char- acteristic functions of real random variables with vanishing expectation. Assume 퐺 is ′ derivable and ‖퐺 ‖∞ ≤ 푚; then for all 푇 > 0:

1 ∫︁ 푇 ⃒휑(푡) − 훾(푡)⃒ 24푚 ⃒ ⃒ ‖퐹 − 퐺‖∞ ≤ ⃒ ⃒ d푡 + . 휋 −푇 푡 휋푇

1 푇 푡2 − 2 ∫︀ − 2 We set 퐺(푇 ) = (2휋) −∞ 푒 d푡 the reduced normal distribution function, so that 2 ′ − 1 − 푡 ‖퐺 ‖∞ = (2휋) 2 and 훾(푡) = 푒 2 , and apply the above estimate to the distribution √1 function 퐹푛 of the random variable 푛 (휙 ˜(푋1) + ··· +휙 ˜(푋푛)), where here 휙˜ is the fully normalized version of 휙:

휙 − 휇0(휙) ∑︁ 휙˜ = where 휎2(휙) = 휇 (휙2) − (휇 휙)2 + 2 휇 (︀휙L푘(휙 ¯))︀, 휎(휙) 0 0 0 0 푘≥1

2 assuming 휎 (휙) > 0 and with 휙¯ := 휙−휇0(휙). We save for later the following observation:

휎2(휙) = 휎2(휙 ¯) 2 ∑︁ 푘 ≤ ‖휙¯ ‖∞ + 2 ‖휙¯‖∞(1 − 훿0) ‖휙¯‖ 푘≥1 2 ≤ ‖휙¯‖2(︀1 + )︀ 훿0 so that

2 − 1 ‖휙˜‖ ≥ (︀1 + )︀ 2 훿0 √︀ ≥ 훿0/3

√푖푡 Applying formula (3) to 푛 휙˜, we obtain an expression for the characteristic function ∫︁ 휑푛(푡) = L √푖푡 휙˜1(푥) d휇(푥) 푛 ∫︁ ∫︁ 푛 (︁ [︀ ]︀푛 )︁ = 휆 푖푡 P 푖푡 1 d휇 + R/휆 1 d휇 √ 휙˜ √ 휙˜ √푖푡 휙˜ 푛 푛 푛 ⏟ ⏞ =:퐴

33 where 휇 is the law of 푋0. From now on, we assume √ ‖푡휙˜‖ √ log 100 푛 ≥ and 푛 ≥ 1 + (︁ 훿2 )︁ log 1 + 0 − log(1 − 훿0/13) 13+12훿0 to apply the estimates from Section5. We will use later that this condition, considering the extremal case 훿0 = 1, implies 푛 ≥ 3311. We then get (Corollary 5.4): ∫︁ ∫︁ −푛 푛 푡 퐴 = P 푖푡 1 d휇 + 휆 R 푖푡 1 d휇 = 1 + 푂 −1 (‖√ 휙˜‖). √ 휙˜ √푖푡 휙˜ √ 휙˜ 3.668+4.121훿 푛 푛 푛 0 푛 We also have from Corollary 5.4

2 푛 (︁ 푡 1 3 )︁ −1 −2 √ 휆 √푖푡 휙˜ = exp − + 푂10.89+20.04훿 +8.577훿 ( ‖푡휙˜‖ ) . 푛 2 0 0 푛

In order to bound |휑푛(푡) − 훾(푡)|, following Feller [Fel66] we use that for all 푎, 푏, 푐 such that |푎|, |푏| ≤ 푐 and all 푛 ∈ N it holds |푎푛 − 푏푛| ≤ 푛|푎 − 푏|푐푛−1.

1 1 We take 푎 = 휑푛(푡) 푛 , 푏 = 훾(푡) 푛 and 푐 an upper bound which we will now choose. Feller 2 2 − 푡 1 −훼 푡 takes 푐 = 푒 4푛 , but we need two adaptations and take 푐 = 1.32 푛 푒 푛 where 훼 ∈ (0, 0.5) will be optimized later on. 2 1 − 푡 We have 훾(푡) 푛 = 푒 2푛 ≤ 푐 and

2 1 − 푡 (︀ −1 −2 1 3 )︀ 1 휑 (푡) 푛 ≤ 푒 2푛 exp (10.89 + 20.04훿 + 8.577훿 )( ‖푡휙˜‖ ) 퐴 푛 푛 0 0 푛3/2 where 푡 퐴 ≤ 1 + (3.834 + 4.121훿−1)‖√ 휙˜‖ 0 푛 2 −1 훿0 ≤ 1 + (3.834 + 4.121훿0 ) 13 + 12훿0 ≤ 1.32

1 To ensure 휑푛(푡) 푛 ≤ 푐, it is therefore sufficient that 1 (10.89 + 20.04훿−1 + 8.577훿−2)(√ ‖푡휙˜‖3) ≤ (0.5 − 훼)푡2 0 0 푛 i.e. it is sufficient to ask √ 10.89 + 20.04훿−1 + 8.577훿−2 푛 ≥ 0 0 |푡|‖휙˜‖3 (10) 0.5 − 훼

34 Under this assumption, we have (using 푛 ≥ 3311 to bound (푛 − 1)/푛 by 0.9996):

−0.9996훼푡2 ⃒ 1 1 ⃒ |휑푛(푡) − 훾(푡)| ≤ 1.32푛푒 ⃒휑푛(푡) 푛 − 훾(푡) 푛 ⃒. (11)

1 1 Now we will bound |휑푛(푡) 푛 − 훾(푡) 푛 |, starting by a finer evaluation of 퐴:

1 푡 1 푛 푛 퐴 = (1 + 푂3.834+4.121훿−1 (‖√ 휙˜‖)) 0 푛 1 ≤ exp (︀ (3.834 + 4.121훿−1)‖푡휙˜‖))︀ 푛3/2 0 By our assumptions the argument of the exponential is not greater than

2 2 1 −1 (︁ 훿0 )︁ 1 3.834훿0 + 4.121훿0 (3.834 + 4.121훿0 ) log 1 + ≤ ≤ 0.0001 푛 13 + 12훿0 3311 13 + 12훿0 and using 푒0.0001 ≤ 1.0002, for all 휀 ∈ [0, 0.0001] we have exp(휀) ≤ 1 + 1.0002휀 and therefore: −1 1 3.835 + 4.122훿0 퐴 푛 ≤ 1 + ‖푡휙˜‖ 푛3/2 Now we have

−1 2 2 2 1 1 ⃒ 3.835 + 4.122훿0 푡 ⃒ ⃒ − 푡 푡 ⃒ ⃒ 푛 푛 ⃒ ⃒ ⃒ ⃒ 2푛 ⃒ ⃒휑푛(푡) − 훾(푡) ⃒ ≤ 휆 √푖푡 휙˜(1 + ‖푡휙˜‖) − 1 + + 푒 − 1 + ⃒ 푛 푛3/2 2푛⃒ ⃒ 2푛⃒

−푥 1 2 Since for all 푥 ∈ [0, +∞[ we have 0 ≤ 푒 − 1 + 푥 ≤ 2 푥 , the second summand is 푡4 2 bounded above by 8푛2 . In the first summand we use (Lemma 5.2, definition of 휎 and normalization of 휙˜)

2 푡 (︀ 푡 3)︀ 휆 √푖푡 휙˜ = 1 − + 푂7.41+17.75훿−1+8.49훿−2 ‖√ 휙˜‖ . 푛 2푛 0 0 푛 The lower order terms simplify and we obtain

−1 4 1 1 ⃒ 푡 3 3.835 + 4.122훿0 ⃒ 푡 ⃒ 푛 푛 ⃒ ⃒ ⃒ ⃒휑푛(푡) − 훾(푡) ⃒ ≤ 푂7.41+17.75훿−1+8.49훿−2 (‖√ 휙˜‖ ) + 휆 √푖푡 휙˜ ‖푡휙˜‖ + ⃒ 0 0 푛 푛 푛3/2 ⃒ 8푛2 1 (︁ ≤ (7.41 + 17.75훿−1 + 8.49훿−2)‖푡휙˜‖3 푛3/2 0 0 )︁ 푡4 + 1.0524(3.835 + 4.122훿−1)‖푡휙˜‖ + 0 8푛2 (7.41 + 17.75훿−1 + 8.49훿−2)‖푡휙˜‖3 + (4.036 + 4.338훿−1)‖푡휙˜‖ 푡4 ≤ 0 0 0 + 푛3/2 8푛2 For all 푇 > 0 such that the above conditions on 푛, 푡 hold for all 푡 ∈ [−푇, 푇 ], we have by Proposition 7.1:

1 ∫︁ 푇 ⃒휑(푡) − 훾(푡)⃒ 24푚 ⃒ ⃒ ‖퐹푛 − 퐺‖∞ ≤ ⃒ ⃒ d푡 + 휋 −푇 푡 휋푇

35 ∫︁ 푇 2.64 −0.9996훼푡2 ⃒ 1 1 ⃒ 3.048 ≤ 푛푒 ⃒휑푛(푡) 푛 − 훾(푡) 푛 ⃒ d푡 + 휋 0 푇 ∞ 2.64 ∫︁ 2 3.048 ≤ √ 푒−0.9996훼푡 (︀푑‖푡휙˜‖3 + 푓‖푡휙˜‖ + 푔푡4)︀ d푡 + 휋 푛 0 푇

−1 −2 −1 where 푑 = 7.41 + 17.75훿0 + 8.49훿0 , 푓 = 4.036 + 4.338훿0 and, using 푛 ≥ 3311, 푔 = 0.0022. We want to take 푇 as large as possible to lower the last term, but we need to ensure two conditions:

√ 2 √ 푛 (︁ 훿0 )︁ 푛 (0.5 − 훼) 푇 ≤ log 1 + and 푇 ≤ 3 −1 −2 ‖휙˜‖ 13 + 12훿0 ‖휙˜‖ 10.89 + 20.04훿0 + 8.577훿0 We could use here the lower bound on ‖휙˜‖ to replace the left condition by a condition of the same form as the right one, but this would be too strong when ‖휙˜‖ is far from the bound. We will make a choice which will be better when ‖휙˜‖ is of the order of 1 (recall 휙˜ is normalized, and therefore insensitive to scaling 휙), by replacing the above conditions by the more stringent

√ 2 푛 {︁ (︁ 훿0 )︁ (0.5 − 훼) }︁ 푇 ≤ 3 min log 1 + , −1 −2 max{‖휙˜‖, ‖휙˜‖ } 13 + 12훿0 10.89 + 20.04훿0 + 8.577훿0

2 In the min, the first term is larger than 0.98훿0/(12 + 13훿0) which is easily seen to be larger than the second term for all 훿0. We thus take √ 푛 (0.5 − 훼) 푇 = 3 −1 −2 max{‖휙˜‖, ‖휙˜‖ } 10.89 + 20.04훿0 + 8.577훿0 and we obtain ∫︁ +∞ 2.64 −0.9996훼푡2 (︀ 3 3 4)︀ ‖퐹푛 − 퐺‖∞ ≤ √ 푒 푑‖휙˜‖ 푡 + 푓‖휙˜‖푡 + 푔푡 d푡 휋 푛 0 (33.193 + 61.082훿−1 + 26.082훿−2) max{‖휙˜‖, ‖휙˜‖3} + 0 √ 0 (0.5 − 훼) 푛

We have for 푑 = 1, 3, 4:

∫︁ +∞ ∫︁ +∞ −0.9996훼푡2 푑 − 푑+1 −푡2 푑 푒 푡 d푡 = (0.9996훼) 2 푒 푡 d푡. 0 0

∫︀ +∞ −푡2 푑 1 푑+1 Since 0 푒 푡 d푡 = 2 Γ( 2 ) we thus have

1.32 (︀ −2 3 −1 − 5 5 )︀ ‖퐹 − 퐺‖ ≤ √ 푑(0.9996훼) ‖휙˜‖ + 푓(0.9996훼) ‖휙˜‖ + 푔(0.9996훼) 2 Γ( ) 푛 ∞ 휋 푛 2 (33.193 + 61.082훿−1 + 26.082훿−2) max{‖휙˜‖, ‖휙˜‖3} + 0 √ 0 (0.5 − 훼) 푛

36 1.32푎 We will now choose 훼, by comparing the two most troublesome coefficients 휋(0.9996훼)2 , −1 −2 0.42푑 (33.193+61.082훿0 +26.082훿0 ) which is close to 훼2 (and makes us want to take 훼 large), and 0.5−훼 2.9푎 which is somewhat close to 0.5−훼 when 훿0 is small (and makes us want to take 훼 small). This leads us to take 훼 = 0.2. We then get 1 (︁ ‖퐹 − 퐺‖ ≤ √ (77.9 + 186.6훿−1 + 89.26훿−2)‖휙˜‖3 + (8.49 + 9.12훿−1)‖휙˜‖ + 0.069 푛 ∞ 푛 0 0 0 −1 −2 3 )︁ + (110.65 + 203.61훿0 + 86.94훿0 ) max{‖휙˜‖, ‖휙˜‖ } 1 (︁ )︁ ≤ √ (197.04 + 399.33훿−1 + 176.2훿−2) max{‖휙˜‖, ‖휙˜‖3} + 0.069 푛 0 0

√︀ 3 −2 Using ‖휙˜‖ ≥ 훿0/3 and since 훿0 ≤ 1, we obtain 0.069 ≤ 0.36 max{‖휙˜‖, ‖휙˜‖ }훿0 so that max{‖휙˜‖, ‖휙˜‖3} ‖퐹 − 퐺‖ ≤ √ (︀197.04 + 399.33훿−1 + 176.56훿−2)︀ 푛 ∞ 푛 0 0 and finally max{‖휙˜‖, ‖휙˜‖3} max{‖휙˜‖, ‖휙˜‖3} ‖퐹 −퐺‖ ≤ 177(훿−2+2.26훿−1+1.1) √ ≤ 177(훿−1+1.13)2 √ 푛 ∞ 0 0 푛 0 푛 which is TheoremC.

References

[Bal00] Viviane Baladi, Positive transfer operators and decay of correlations, Ad- vanced Series in Nonlinear Dynamics, vol. 16, World Scientific Publishing Co., Inc., River Edge, NJ, 2000. MR 1793194 (2001k:37035) 2.4 [Bol82] Erwin Bolthausen, The berry-esseen theorem for strongly mixing harris recurrent markov chains, Probability Theory and Related Fields 60 (1982), no. 3, 283–289.1, 2.4 [BT08] Henk Bruin and Mike Todd, Equilibrium states for interval maps: poten- tials with sup 휑 − inf 휑 < ℎtop(푓), Comm. Math. Phys. 283 (2008), no. 3, 579–611. MR 2434739 2.4 [CP90] Zaqueu Coelho and William Parry, Central limit asymptotics for shifts of finite type, Israel J. Math. 69 (1990), no. 2, 235–249. MR 10453767 [CS09] Van Cyr and Omri Sarig, Spectral gap and transience for Ruelle operators on countable Markov shifts, Comm. Math. Phys. 292 (2009), no. 3, 637– 666. MR 2551790 2.4 [CV13] A. Castro and P. Varandas, Equilibrium states for non-uniformly expanding maps: decay of correlations and strong stability, Ann. Inst. H. Poincaré Anal. Non Linéaire 30 (2013), no. 2, 225–249. MR 3035975 2.4

37 [Dub11] Loïc Dubois, An explicit berry-esséen bound for uniformly expanding maps on the interval, Israel Journal of Mathematics 186 (2011), no. 1, 221–250. 1, 2.4

[Erd39] Paul Erdös, On a family of symmetric bernoulli convolutions, American Journal of Mathematics 61 (1939), no. 4, 974–976. 3.4

[Fel66] William Feller, An introduction to probability theory and its applications. Vol. II, John Wiley & Sons, Inc., New York-London-Sydney, 1966. MR 02101547, 7.1,7

[GD12] David M Gómez and Pablo Dartnell, Simple monte carlo integration with respect to bernoulli convolutions, Applications of Mathematics 57 (2012), no. 6, 617–626. 3.4

[GKLMF15] Paolo Giulietti, Benoît R. Kloeckner, Artur O. Lopes, and Diego Mar- con Farias, The calculus of thermodynamical formalism, arXiv:1508.01297, to appear in J. Eur. Math. Soc., 2015. 2.10

[GO02] Peter W Glynn and Dirk Ormoneit, Hoeffding’s inequality for uniformly ergodic markov chains, Statistics & probability letters 56 (2002), no. 2, 143–146.1,1, 3.2, ??,1, 3.2

[HH01] Hubert Hennion and Loïc Hervé, Limit theorems for Markov chains and stochastic properties of dynamical systems by quasi-compactness, Lecture Notes in Mathematics, vol. 1766, Springer-Verlag, Berlin, 2001.1

[JO10] Aldéric Joulin and Yann Ollivier, Curvature, concentration and error es- timates for Markov chain Monte Carlo, Ann. Probab. 38 (2010), no. 6, 2418–2442. MR 26836341,1, 3.3,2, 3.3.2, 3.3.3, 3.4, 3.4

[KL99] Gerhard Keller and Carlangelo Liverani, Stability of the spectrum for trans- fer operators, Annali della Scuola Normale Superiore di Pisa-Classe di Scienze 28 (1999), no. 1, 141–152.1

[KLMM05] Ioannis Kontoyiannis, Luis A Lastras-Montano, and Sean P Meyn, Relative entropy and exponential deviation bounds for general markov chains, Infor- mation Theory, 2005. ISIT 2005. Proceedings. International Symposium on, IEEE, 2005, pp. 1563–1567.1,1, 2.2, 3.2, ??,1, 3.2

[Klo17a] Benoît R. Kloeckner, Effective high-temperature estimates for intermittent maps, To appear in Ergodic Theory Dynam. Systems, arXiv:1704.00586, 2017. 2.5, 2.4

[Klo17b] , Effective perturbation theory for linear operators, arXiv:1703.09425, 2017.1,4, 4.1, 4.2

38 [Klo17c] , An optimal transportation approach to the decay of correlations for non-uniformly expanding maps, arXiv:1711.08052, 2017. 2.4 [KM12] Ioannis Kontoyiannis and Sean P Meyn, Geometric ergodicity and the spec- tral gap of non-reversible markov chains, Probability Theory and Related Fields (2012), 1–13.1 [Lez98] Pascal Lezaud, Chernoff-type bound for finite Markov ,chains Ann. Appl. Probab. 8 (1998), no. 3, 849–867. MR 16277951, 2.2 [Lez01] , Chernoff and berry–esséen inequalities for markov processes, ESAIM: Probability and Statistics 5 (2001), 183–201.1, 2.2, 2.4 [Nag57] S. V. Nagaev, Some limit theorems for stationary Markov chains, Teor. Veroyatnost. i Primenen. 2 (1957), 389–416. MR 00948461 [Nag61] , More exact limit theorems for homogeneous Markov chains, Teor. Verojatnost. i Primenen. 6 (1961), 67–86. MR 01312911 [Oll09] Yann Ollivier, Ricci curvature of Markov chains on metric spaces, J. Funct. Anal. 256 (2009), no. 3, 810–864. MR 2484937 3.3.1 [Pau15] Daniel Paulin, Concentration inequalities for markov chains by marton cou- plings and spectral methods, Electronic Journal of Probability 20 (2015). 1 [Pau16] , Mixing and concentration by ricci curvature, Journal of 270 (2016), no. 5, 1623–1662.1 [Pet17] Fedor Petrov, Answer to “diameter of a weighted hamming cube”, Math- Overflow, 2017, https://mathoverflow.net/a/286346/4961. 3.5 [PSS00] Yuval Peres, Wilhelm Schlag, and Boris Solomyak, Sixty years of bernoulli convolutions, Progress in probability (2000), 39–68. 3.4 [RR+04] Gareth O Roberts, Jeffrey S Rosenthal, et al., General state space markov chains and mcmc algorithms, Probability Surveys 1 (2004), 20–71.1 [Rue04] David Ruelle, Thermodynamic formalism, second ed., Cambridge Math- ematical Library, Cambridge University Press, Cambridge, 2004, The mathematical structures of equilibrium statistical mechanics. MR 2129258 (2006a:82008) 2.4 [Sol95] Boris Solomyak, On the random series ∑︀ ±휆푛 (an erdos problem), Annals of Mathematics (1995), 611–625. 3.4 [Tyu11] I. S. Tyurin, Improvement of the remainder in the Lyapunov theorem, Teor. Veroyatn. Primen. 56 (2011), no. 4, 808–811. MR 3137072 2.4

39