<<

Estimating accuracy of the MCMC :

a central limit theorem for batch means

Saptarshi Chakraborty∗, Suman K. Bhattacharya† and Kshitij Khare†

∗Department of Epidemiology & Biostatistics Memorial Sloan Kettering Cancer Center 485 Lexington Ave New York, NY 10017, USA e-mail: [email protected]

†Department of University of Florida 101 Griffin Floyd Hall Gainesville, Florida 32601, USA e-mail: [email protected] e-mail: [email protected] Abstract: The batch means estimator of the MCMC variance is a simple and effective measure of accuracy for MCMC based ergodic averages. Under various regularity conditions, the estimator has been shown to be consistent for the true variance. However, the estimator can be unstable in practice as it depends directly on the raw MCMC output. A measure of accuracy of the batch means estima- tor itself, ideally in the form of a confidence interval, is therefore desirable. The asymptotic variance of the batch means estimator is known; however, without any knowledge of asymptotic distribution, asymptotic are in general insufficient to describe variability. In this article we prove a central limit theorem for the batch means estimator that allows for the construction of asymptotically accu- rate confidence intervals for the batch means estimator. Additionally, our results provide a analogue of the classical CLT for the sample variance parameter for i.i.d. observations. Our result assumes standard regularity conditions similar to the ones assumed in the literature for proving consistency. Simulated and real data examples are included as illustrations and applications of the CLT. arXiv:1911.00915v1 [stat.CO] 3 Nov 2019 MSC 2010 subject classifications: Primary 60J22; secondary 62F15. Keywords and phrases: MCMC variance, batch means estimator, asymptotic normality.

1. Introduction

Markov chain Monte Carlo (MCMC) techniques are indispensable tools of modern day computations. Rou- tinely used in Bayesian analysis and machine learning, a major application of MCMC lies in the approxima- tion of intractable and often high-dimensional integrals. To elaborate, let (X , F, ν) be an arbitrary measure space and let Π be a probability measure on X , with associated density π(·) with respect to ν. The quantity

1 Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 2 of interest is the integral Z Z πf = Eπf := f(x) dΠ(x) = f(x) π(x) ν(dx) X X

where f is a real-valued, Π−integrable function on X . In many modern applications, the such an integral is often intractable, i.e., (a) does not have a closed form, (b) deterministic approximations are inefficient, often due to the high dimensionality of X , and (c) cannot be estimated via classical or i.i.d. Monte Carlo techniques as i.i.d. random generation from Π is in general infeasible. Markov chain Monte Carlo (MCMC)

techniques are the to-go method of approximation for such integrals. Here, a Markov chain (Xn)n≥1 with an invariant Π [see, e.g. 22, for definitions] is generated using some MCMC sampling

technique such as the Gibbs sampler or the Metroplis Hastings . Then, ergodic averages f n := −1 Pn n i=1 f(Xi) based on realizations of the Markov chain (Xn)n≥1 are used as approximations of Eπf. Measuring the errors incurred in approximations is a critical step in any . It is well known that when a Markov chain is Harris ergodic (i.e., aperiodic, φ-irreducible, and Harris recurrent [see 22, for definitions]), then ergodic averages based on realizations of the Markov chain always furnish strongly consistent estimates of the corresponding population quantities [22, Theorem 13.0.1]. In other words, if a

Harris ergodic chain is run long enough, then the estimate f n is always guaranteed to provide a reasonable

approximation to the otherwise intractable quantity Eπf (under some mild regularity conditions on f). Determining an MCMC sample (or iteration) size n that justifies this convergence, however, requires a measurement of accuracy. Similar to i.i.d. Monte Carlo estimation, the standard error of f n obtained from the MCMC central limit theorem (MCMC CLT) is the natural quantity to use for this purpose. MCMC CLT

requires additional regularity conditions as compared to its i.i.d. counterpart; if the Markov chain (Xn)n≥1

2+δ is geometrically ergodic (see, e.g., Meyn and Tweedie [22] for definitions), and if Eπ|f| for some δ > 0

2 (or Eπf < ∞ if (Xn)n≥1 is geometrically ergodic and reversible), it can be shown that as n → ∞

√  d 2 n f n − Eπf −→ N(0, σf )

2 where σf is the MCMC variance defined as

∞ 2 X σf = varπ f(X1) + 2 covπ (f(X1), f(Xi)) . (1.1) i=2

Here varπ and covπ respectively denote the variance and (auto-) covariance computed under the stationary distribution Π. Note that other sufficient conditions ensuring the above central limit theorem also exist; see the survey articles of Jones et al. [16], and Roberts and Rosenthal [32] for more details. When the

regularity conditions hold, a natural measure of accuracy for f n is therefore given by the MCMC standard √ error (MCMCSE) defined as σf / n. Note that this formula of MCMCSE, alongside measuring the error in approximation, also helps determine an optimum iteration size n that is required to achieve a pre-specified Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 3

2 level of precision, thus providing a stopping rule for terminating MCMC sampling. A related use of σf also lies 2 in the computation of effective sample size ESS = n varπ f(X1)/σf [18, 29]. ESS measures how n dependent MCMC samples compare to n i.i.d. observations from Π, thus providing a univariate measure of the quality

2 of the MCMC samples. Thus to summarize, the MCMC variance σf facilitates computation/determination of three crucial aspects of an MCMC implementation, namely (a) stopping rule for terminating simulation,

(b) effective sample size (ESS) of the MCMC draws, and (c) precision of the MCMC estimate f n. 2 In most non-trivial applications, however, the MCMC variance σf is usually unknown, and must be 2 estimated. A substantial literature has been devoted to the estimation of σf [see, e.g.,3,9, 12, 13, 14, 23, 31, 10, 11, to name a few], and several methods, such as regerative sampling, spectral variance estimation, and overlapping and non-overlapping batch means estimation, have been developed. In this paper, we focus on the non-overlapping batch means estimator, henceforth called the batch means estimator for simplicity, where

2 estimation of σf is performed by breaking the n = anbn Markov chain iterations into an non-overlapping blocks or batches of equal size bn. Then, for each k ∈ {1, 2, ··· , an}, one calculates the k-th batch mean

1 Pbn 1 Pan Zk := Z , and the overall mean Z := Zk, where Zi = f(Xi) for i = 1, 2,... , and bn i=1 (k−1)bn+i an i=1 2 finally estimates σf by an 2 2 bn X 2 σˆBM,f =σ ˆBM,f (n, an, bn) = Zk − Z . (1.2) an − 1 k=1 The batch means estimator is straightforward to implement, and can be computed post-hoc without making any changes to the original MCMC , as opposed to some other methods, such as regeneration

2 sampling. Under various sets of regularity conditions, the batch mean estimatorσ ˆBM,f has been shown to 2 be strongly consistent [7, 15, 17, 11] and also mean squared consistent [5, 11] for σf , provided the batch

size bn and the number of batches an both increase with n. Note that the estimator depends on the choice

of the batch size bn (and hence the number of batches an = n/bn). Optimal selection of the batch-size is

1/2 1/3 still an open problem, and both bn = n and bn = n have been deemed desirable in the literature; the

former ensures that the batch means {Zk} approach asymptotic normality at the fastest rate (under certain

2 regularity conditions, [6]), and the latter minimizes the asymptotic mean-squared error ofσ ˆBM,f (under different regularity conditions, [34]). It is however important to recognize that consistency alone does not in general justify practical usefulness, and a measurement of accuracy is always required to assess the validity of an estimator. It is known that

2 4 the asymptotic variance of the batch means estimator is given by varσ ˆBM,f = 2σf /an + o(1/n), under various regularity conditions [5, 11]. However, without any knowledge of the asymptotic distribution, the asymptotic variance alone is generally insufficient for assessing the accuracy of an estimator. For example, a ±2 standard error bound does not in general guarantee more than 75% coverage as obtained from the Chebyshev inequality, and to ensure a pre-specified (95%) coverage, a much larger interval (∼ ±4.5 standard Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 4 error) is necessary in general. This provides a strong practical motivation for determining the asymptotic distribution of the batch means estimator. To the best of our knowledge, however, no such result is available. The main purpose of this paper is to establish a central limit theorem that guarantees asymptotic normality of the batch means estimator under mild and standard regularity conditions (Theorem 2.1). There are two major motivations for our work. As discussed above, the first motivation lies in the immediate practical implication of this work. As a consequence of the CLT, the use of approximate normal confidence intervals for measuring accuracy of batch means estimators is justified. Given MCMC samples, such intervals can be computed alongside the batch means estimator at virtually no additional cost, and therefore could be of great practical relevance. The second major motivation comes from a theoretical point of view. Although a central limit theorem for the sample variance of an i.i.d. Monte Carlo estimate is known (can be easily established via delta method, for example), no Markov chain Monte Carlo analogue of this result is available. Our paper provides an answer to this yet-to-be-addressed theoretical question. The proof is quite involved and leverages operator theory and the martingale central limit theorem [see, e.g.,1], as opposed to the Brownian motion based approach adopted in [11], and the result is analogous to the classical CLT for sample variance in the i.i.d. Monte Carlo case. The remainder of this article is organized as follows. In Section2 we state and prove the main central limit theorem along with a few intermediate results. Section3 provides two illustrations of the CLT – one based on a toy example (Section 3.1), and one based on a real world example (Section 3.2). Proofs of some key propositions and intermediate results are provided in the Appendix.

2. A Central Limit Theorem for Batch-Means Estimator

This section provides our main result, namely, a central theorem for the non-overlapping batch-means - dard error estimator. Before stating the theorem, we fix our notations, and review some known results on

Markov chains. Let (Xn)n≥1 be a Markov chain on (X , F, ν) with Markov transition density k(·, ·), and stationary measure Π (with density π). We denote by K(·, ·), the Markov transition function of (Xn)n≥1; in 0 0 particular, for x ∈ X and a Borel set A ⊆ X , K(x, A) = A k(x, x ) dx . For m ≥ 1, the associated m-step Markov transition function is defined in the following inductive fashion Z m m−1 0 0 K (x, A) = K (x ,A) K(x, dx ) = Pr(Xm+j ∈ A | Xj = x) p R

1 0 for any j = 0, 1,... , with K ≡ K. The Markov chain (Xn)n≥1 is said to be reversible, if for any x, x ∈ X the detailed balance condition π(x)K(x, dx0) = π(x0)K(x0, dx) Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 5 is satisfied. Also, the chain (Xn)n≥1 is said to be geometrically ergodic if there exists a constant κ ∈ [0, 1) and a function Q : X → [0, ∞) such that for any x ∈ X and any m ∈ {1, 2,... }

kKm(x, ·) − Π(·)k := sup |Km(x, A) − Π(A)| ≤ Q(x)κm. A⊆X

Let us denote by  Z Z  2 2 2 L0(π) = g : X → R : Eπg = f(x) dΠ(x) = 0 and Eπg = g(x) dΠ(x) < ∞ . X X

2 This is a Hilbert space where the inner product of g, h ∈ L0(π) is defined as Z Z hg, hiπ = g(x) h(x) dΠ(x) = g(x) h(x) π(x) dν(x) X X p and the corresponding norm is defined by kgkπ = hg, giπ. The Markov transition function K(·, ·) determines a Markov operator; we shall slightly abuse our notation and denote the associated operator by K as well.

2 2 2 More specifically, we shall let K : L0(π) → L0(π) denote the operator that maps g ∈ L0(π) to (Kg)(x) = R 0 0 g(x )K(x, dx ). The operator norm of K is defined as kKk = sup 2 kKgk. It follows that X g∈L0(π):kgkπ =1 kKk ≤ 1. Roberts and Rosenthal [30] show that for reversible (self-adjoint) K, kKk < 1 if and only if the associated Markov chain (Xn)n≥1 is geometrically ergodic. The following theorem establishes a CLT for the batch means estimator of MCMC variance.

Theorem 2.1. Suppose (Xn)n≥1 is a stationary geometrically ergodic reversible Markov chain with state 8 space X and invariant distribution Π. Let f : X → R be a Borel function with Eπ(f ) > 0. Consider the 2 2 2 batch means estimator σˆBM,f =σ ˆBM,f (n, an, bn) of the MCMC variance σf as defined in (1.2). Let an and √ bn be such that an → ∞, bn → ∞ and an/bn → 0 as n → ∞. Then

√ 2 2 d 4 an σˆBM,f (n, an, bn) − σf −→ N 0, 2σf

2 where σf is the MCMC variance as defined in (1.1).

Remark 2.1 (Proof technique). Our proof is based on an operator theoretic approach, and relies on a careful manipulation of appropriate moments, and the martinagle CLT. Previous work in [5,7, 11] on

2 consistency ofσ ˆBM,f is based on a Brownian motion based approximation (see [11, Equation ??]). This leads to some differences in the assumptions that are required to prove the respective results. Note again that [5,7, 11] do not explore a CLT for the batch means estimator.

Remark 2.2 (Discussion of assumptions: Uniform vs. Geometric ergodicity, reversibility and moments). Our results require geometric ergodicity of the Markov chain, which in general is required to

2 guarantee CLT of the MCMC estimate f n itself. The consistency ofσ ˆBM,f in [5] and [7] have been proved under uniform ergodicity of the Markov chain, which is substantially more restrictive and difficult to justify Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 6 in practice. On the other hand, [11] consider a Brownian motion based approach to prove their result. The consistency result in [11] holds under geometric ergodicity, however, verifying a crucial Brownian motion based sufficient condition can be challenging when the chain is not uniformly ergodic. On the other hand, we require reversibility of the Markov chain which is not a requirement in [5,7, 11]. Note that the commonly used Metropolis-Hastings algorithm and its modern efficient extension, the algorithm, are necessarily reversible [12, 24]. Also, for any Gibbs sampler, a reversible counterpart can always be constructed through random scans or reversible fixed scans [2, 12], and a two-block Gibbs sampler is always reversible. We require the function f to have a finite eighth moment, while the consistency results in [7] assume the existence of twelfth moment and those in [11] assume moments of order 4 + δ +  for some δ > 0 and  > 0. Note again that the authors in [11] do not establish a CLT.

Remark 2.3 (Stationarity). It is to be noted that Theorem 2.1 assumes stationarity, i.e., the initial measure of the Markov chain is assumed to be the stationary measure. This is similar to the assumptions made in [7,6] for establishing consistency. A moderate burn-in or warm-up period for an MCMC algorithm is usually enough to guarantee stationarity in practice.

Remark 2.4 (Choice of an and bn). Consider the two practically recommended choices [10] (i) an =

√ √ 1/3 bn = n and (ii) an = bn = n as mentioned in the Introduction. Clearly, (i) satisfies the sufficient conditions on an and bn described in Theorem 2.1 and hence, batch means estimators based on this choice attains a CLT, provided the other conditions in Theorem 2.1 hold. On the other hand, (ii) does not satisfy the conditions in Theorem 2.1, and hence a CLT is not guaranteed with this choice. Small adjustments, such

−δ+2/3 δ+1/3 2/3 −δ 1/3 δ as an = n , bn = n for some small 0 < δ < 2/3, and an = n (log n) and bn = n (log n) for some (small) δ > 0, could be used to technically satisfy the sufficient condition, however, the resulting convergence in distribution may be slow (see the toy example in Section 3.1).

Before proving Theorem 2.1, we first introduce some notation, and then state and prove some intermediate

results. Suppose the Markov chain (Xn)n≥1 and the function f satisfy the assumptions made in Theorem 2.1.

2 Define Yi = f(Xi) − Eπf for i = 1, 2,... , and write the batch-means estimatorσ ˆBM,f in (1.2) as

an an ! 2 2 bn X 2 an bn X 2 2 σˆBM,f =σ ˆBM,f (n, an, bn) = Y k − Y = Y k − bnY . an − 1 an − 1 an k=1 k=1

−1 Pbn −1 Pan Here Y k := bn i=1 Y(k−1)bn+i, and Y := an i=1 Y k. We shall consider the related quantity

an   2 bn X 2 2 an − 1 2 σ˜BM,f := Y k − bnY = σˆBM,f (2.1) an an k=1 and call it the modified batch means estimator. The following two lemmas establish two asymptotic results on the modified batch means estimator. The first lemma proves asymptotic normality for the modified batch Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 7 means estimator (with a shift) whenever an → ∞ and bn → ∞. Key propositions needed in the proof of this lemma are provided in the Appendix.

2 Lemma 2.1. Consider the modified batch means estimator σ˜BM,f as defined in (2.1). If an → ∞ and bn → ∞ as n → ∞, then

√ h 2 2  2  2 2i 4 an σ˜BM,f − σf − Eπ(˜σBM,f ) − Eπ bnY − σf −→ N(0, 2σf )

2 where σf is the MCMC variance as defined in (1.1).

Proof. First observe that

√ h 2 2  2  2 2i an σ˜BM,f − σf − Eπ(˜σBM,f ) − Eπ bnY − σf

√  2 2  2 = an σ˜BM,f − Eπ(˜σBM,f ) − Eπ bnY

an an ! √ bn X 2 2 bn X  2  = an Y k − bnY − Eπ Y k an an k=1 k=1 an bn X  2  2  √ 2 = √ Y k − Eπ Y k − an bnY an k=1 an bn X n 2  2   2   2 o = Y k − E Y k | Fk−1 + E Y k | Fk−1 − Eπ Y k an k=2 bn  2  2 √ 2 + √ Y1 − Eπ Y1 − an bnY an

an bn X n 2  2   o = Y k − E Y k | Fk−1 + h X(k−1)bn − Eπh X(k−1)bn an k=2 bn  2  2 √ 2 + √ Y1 − Eπ Y1 − an bnY . (2.2) an

Here, for 1 ≤ k ≤ an, Fk,n is the sigma-algebra generated by X1,...,Xkbn , and

 2   2  h(X(k−1)bn ) := E Y k | Fk,n = E Y k | X(k−1)bn

˜ 2 due to the Markovian structure of (Xn)n≥1. Let h = h − Eπh ∈ L0(π). Since the Markov operator K has operator norm λ = kKk < 1 (due to geometric ergodicity), it follows that I − K is invertible (using, e.g., the

−1 P∞ j bn bn expansion (I −K) = j=0 K ). Therefore, I −K is also invertible, since K is also a Markov operator. Consequently, one can find ag ˜ such thatg ˜ = (I − Kbn )−1h˜, i.e., h˜ =g ˜ − Kbn g˜. Then

 ˜  h X(k−1)bn − Eπh(X(k−1)bn ) = h X(k−1)bn

 bn  =g ˜ X(k−1)bn − K g˜ X(k−1)bn

    bn  = g˜ X(k−1)bn − g˜ (Xkbn ) + g˜ (Xkbn ) − K g˜ X(k−1)bn

Hence

an X    h X(k−1)bn − Eπh X(k−1)bn k=2 Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 8

an X   = g˜ (Xkbn ) − E g˜(Xkbn ) | X(k−1)bn +g ˜ (Xbn ) − g˜ (Xanbn ) k=2

so that from (2.2),

√  2 2  2 an σ˜BM,f − Eπ(˜σBM,f ) − Eπ bnY

an bn X h 2  2  i = √ Y k − E Y k | X(k−1)bn +g ˜ (Xkbn ) − E g˜(Xkbn ) | X(k−1)bn an k=2 bn bn  2  2 √ 2 + √ (˜g (Xkbn ) − g˜ (Xanbn )) + √ Y 1 − Eπ Y 1 − an bnY an an

= T1 + T2 + T3 − T4, say. (2.3)

We shall note the convergences of the terms T1, T2, T3 and T4 separately. From Markov chain CLT, we have √ √ d 2 √ 2 2 n Y = anbn Y −→ N(0, σf ). Therefore, ( n Y ) = anbnY = OP (1), which means,

√ 2 1 2 T4 = an bnY = √ · anbnY = oP (1). an

Again, for all 1 ≤ k ≤ an,

2 2 2 bn −1˜ kbn g˜(Xkbn )kπ = bn (I − K ) h(Xkbn ) π   2 ∞ 2 X bnj ˜ = bn  K  h(Xkbn )

j=0 π 2  ∞  X 2 2 bnj ˜ ≤ bn  kKk  h(Xkbn ) π j=0 2  1   2  = varπ E bn Y k | X(k−1)bn 1 − λbn  2 1  2 4  4 ≤ Eπ bn Y k → 3σf 1 − λbn

4 P∞ bnj bn −1 2 4 since j=0 kKk = (1 − λ ) → 1 as λ = kKk ∈ (0, 1) and E(bn Y k) → 3σf from Proposition A.2.

Consequently, bng˜(Xkbn ) = Op(1) and hence

bn T2 = √ (˜g (Xkbn ) − g˜ (Xanbn )) = oP (1). an

Again using the Markov chain CLT for Y 1, it follows that

bn  2  2 T3 = √ Y 1 − Eπ Y 1 = oP (1). an

Finally, note that the terms inside the summation sign in T1, i.e.,

2  2   ζk,n = Y k − E Y k | X(k−1)bn +g ˜ (Xkbn ) − E g˜(Xkbn ) | X(k−1)bn Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 9 forms a martingale difference sequence (MDS), for k ≥ 2. Let

q 2 ξk,n = ζk,n/ Eπ(ζk,n) (2.4)

2+δ 8 Of course Eπξk−1,n = 0, varπ ξk,n = 1 and Eπ |ξk,n| < ∞, e.g., for δ = 1 as Eπ(f ) < ∞, by assumption.

−1 Pan 2 P Then, for each n ≥ 1, (ξk,n)k≥2 is a mean 0 and variance 1 MDS with (an − 1) k=2 E(ξk,n | Fk,n) −→ 0 (Proposition A.1 in AppendixA). Therefore,

an 1 X d √ ξ −→ N(0, 1) a − 1 k,n n k=2

as n → ∞, by the Lyapunov CLT for MDS [1, Theorem 1.3]. Hence,

an an an bn X bn X 1 X d 2 T1 = √ ζk,n = √ τnξk,n = bnτn √ ξk,n −→ N(0, c ) an an an k=2 k=2 k=2

2 2 2 2 2 as long as bnτn → c as n → ∞ for some c > 0, where τn = Eπ(ζk,n). Now,

2 2 2 h 2  2  i 2 bnτn = Eπ bnY 1 − E bnY 1 | X0 + bng˜ (Xbn ) − E (bng˜(Xbn ) | X0) = Eπ [Un + Vn]

where 2  2  ˜ Un = bnY 1 − E bnY 1 | X0 + bnh (Xbn ) (2.5)

and ˜ Vn = bng˜ (Xbn ) − bnh (Xbn ) − E (bng˜(Xbn ) | X0) . (2.6)

2 4 2 From Propositions A.4 and A.5 in AppendixA, it follows that Eπ(Un) → 2σf and Eπ(Vn ) → 0 as 2 2 n → ∞, where σf is the MCMC variance (1.1). Therefore, by Schwarz’s inequality, 0 ≤ {Eπ(UnVn)} ≤ 2 2 Eπ(Un)Eπ(Vn ) → 0, i.e., E(UnVn) → 0 and hence

2 2 2 2 2 4 bnτn = Eπ(Un + Vn) = Eπ(Un + Vn + 2UnVn) → 2σf .

d 4 Consequently, T1 −→ N(0, 2σf ). Using this in (2.3), together with the fact that each of T2, T3 and T4 is oP (1), completes the proof.

We now state and prove our second lemma. This lemma shows that the shift in Lemma 2.1 is asymptotically

1/3 1/3 negligible if an is of an order smaller than n . On the other hand, if an is of a larger order than n , and

2 K is a positive operator (hg, Kgiπ ≥ 0 for all g ∈ L0(π)), then the shift diverges to infinity asymptotically.

2 Lemma 2.2. Consider the modified batch means estimator σ˜BM,f as defined in (2.1). As n → ∞, we have,

√  2 √ 2  2 (i) an Eπ σ˜BM,f + Eπ bnY − σf → 0 if an/bn → 0,

(ii) in addition, if the Markov operator K associated with (Xn)n≥1 is positive, self-adjoint, and K(f − √  2 √ 2  2 Eπf) 6≡ 0, then an Eπ σ˜BM,f + Eπ bnY − σf → ∞ if an/bn → ∞. Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 10

Proof. On the outset, note that " n # ! √  2   2 2 √ bn X 2 2 an Eπ σ˜BM,f + Eπ bnY − σf = an Eπ Y k − σf an k=1

√   2 2 = an bnEπ Y1 − σf . (2.7)

2 where σf is the MCMC variance defined in (1.1). Now

bn−1 !  2 1 2 1 X bnEπ Y1 = Eπ(Y1 + Y2 + ··· + Ybn ) = bnγ0 + 2 (bn − k)γk bn bn k=1 2 P∞ and from (1.1), σf = γ0 + 2 k=1 γk where for any h ≥ 0, γh denotes the auto-covariance

h γh = covπ(Y1,Y1+h) = Eπ(Y1Y1+h) = Eπ[Y1E(Y1+h | X1)] = hf0,K f0i. (2.8)

2 0 h Here f0 = f − Eπf ∈ L0(π), K ≡ I (the identity operator), and K for h ≥ 1 denotes the operator associated with the h-step Markov transition function. Therefore, from (2.7), it follows that

√  2 2  2 an Eπ σ˜BM,f + Eπ bnY − σf b −1 ! ∞ ! √ 1 Xn X = an bnγ0 + 2 (bn − k)γk − γ0 + 2 γk bn k=1 k=1

√ bn−1 ∞ an X X = bnγ0 + 2 (bn − k)γk − bnγ0 − 2bn γk bn k=1 k=1

√ bn−1 ∞ √ bn−1 ∞ an X X 2 an X X = −2 kγk − 2bn γk = kγk + bn γk . (2.9) bn bn k=1 k=bn k=1 k=bn Using triangle inequality on the right hand side of (2.9), we get

√ bn−1 ∞ ! √ 2   2 2 2 an X X an Eπ σ˜BM,f + Eπ bnY − σf ≤ k|γk| + bn |γk| bn k=1 k=bn √ (?) bn−1 ∞ ! 2 an 2 X k X k ≤ kf0kπ kλ + bn λ bn k=1 k=bn √ ∞ √ 2 an 2 X k 2 an λ ≤ kf0kπ kλ = · . (2.10) bn bn 1 − λ k=1

√ 2 2 2 √ It follows that an|Eπ(˜σBM,f ) + Eπ(bnY ) − σf | → 0 if an/bn → 0 as n → ∞. Here λ = kKk < 1 (as h h 2 the chain is geometrically ergodic), and (?) follows from the fact that |γh| = |hf0,K f0iπ| ≤ kKk kf0kπ = h 2 λ kf0kπ. This proves (i). h As for (ii), note that if K is a positive operator, then γh = hf0,K f0iπ ≥ 0 for all h ≥ 0. Moreover,

2 2 reversibility of (Xn)n≥1 implies, γ2 = hf0,K f0iπ = hKf0, Kf0iπ = kKf0kπ > 0 (since Kf0 6≡ 0 by assumption). Consequently, the terms under the absolute sign in the right hand side of (2.9) is bounded below by 2γ2 > 0. As such √ √ 2   2 2 an an Eπ σ˜BM,f + Eπ bnY − σf ≥ 4 γ2. (2.11) bn Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 11

√ 2 2 2 √ It follows that an|Eπ(˜σBM,f ) + Eπ(bnY ) − σf | → ∞ if an/bn → ∞ as n → ∞. This proves (ii). This proves (ii).

With Lemma 2.2 and 2.1 proved, we are now finally in a position to formally prove Theorem 2.1, which is essentially a combination of these two lemmas, and the fact that the modified batch means estimator is asymptotically equivalent to the batch means estimator.

Proof of Theorem 2.1. Observe that

√ 2 2 √  2   2 2 an σ˜BM,f − σf = an Eπ σ˜BM,f + Eπ bnY − σf

√  2 2  2 + an σ˜BM,f − Eπ(˜σBM,f ) − Eπ bnY

d 4 −→ N 0, 2σf , from Lemma 2.2, Lemma 2.1 and Slutsky’s theorem. Therefore,    √ 2 2 √ an 2 2 an σˆBM,f − σf = an σ˜BM,f − σf an − 1    √  an √ 2 2 an 2 = an σ˜BM,f − σf − σf an − 1 an − 1 d 4 −→ N 0, 2σf ,

by another application of Slutsky’s theorem. This completes the proof.

3. Illustration

This section illustrates the applicability of the central limit theorem through replicated frequentist eval-

uations of the batch means MCMC variance estimator. To elaborate, given a total iteration size n + n0,

where n denotes the final MCMC iteration size and n0 denotes the burn-in size, we generate replicated

(n + n0)-realizations of a Markov chain with different and independent random starting points, and evaluate an appropriate function f at each Markov chain realization. The batch means MCMC variance estimates

2 σˆBM,f (n, an, bn) for a few different choices of bn (and an = n/bn) are subsequently computed from each Markov chain after discarding burn-in (to ensure stationarity). This provides a frequentist sampling distri-

2 bution ofσ ˆBM,f (n, an, bn) for a given iteration size n, batch size bn and number of batches an. The whole experiment is then repeated for increasing values of n to empirically assess the limiting behavior of the corresponding sampling distributions. We consider two examples – a simulated toy example (Section 3.1) with a Markov chain for which the true (population) MCMC variance is known, and a real example (Section 3.2) with a practically useful Markov chain used that aids in a high-dimensional linear regression framework. The former illustrates the validity and accuracy of the CLT while the latter illustrates applicability of our results in real Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 12 world scenarios. All computations in this section are done in R v3.4.4 [27], and the packages tidyverse [36] and flare [20] are used.

3.1. Toy example: Gibbs sampler with normal conditional distributions

In this section we consider a two-block toy normal Markov chain (xn, zn)n≥0 with a state space R2 and transition x | z ∼ N(z, 1/4) and z | x ∼ N(x/2, 1/8). Our interest lies in the x-subchain, which

evolves as xn+1 = xn/2 + N(0, 3/8). We consider the identity function f(x) = x, and seek to estimate the corresponding MCMC variance. The example has been considered multiple times in the literature [8, 26,4] and many operator theoretic properties of the chain have been thoroughly examined. In particular, the

−n eigenvalues of the associated Markov operator have been obtained as (2 )n≥0 [8]. This, together with reversibility of the Markov chain (since the marginal chain of a two-block Gibbs sampler is always reversible, [12]) implies geometric ergodicity. It is straight-forward to see that the target stationary distribution π is the normal distribution N(0, 1/2), and the h-th order auto-covariance for the x chain, h ≥ 0, is given by

h −(1+h) γh = covπ(xh, x0) = hf −Eπf, K (f −Eπf)i = 2 . Consequently, the true (population) MCMC variance of the chain is given by ∞ ∞ X 1 X 1 1 σ2 = γ + 2 γ = + = + 1 = 1.5. f 0 h 2 2h 2 h=1 h=1 To assess the asymptotic performances of the batch means estimator in this toy example, we generate 5,000 replicates of the proposed Markov chain, each with an iteration size of 520,000 and an independent standard normal starting point for x. In each replicate, after throwing away the initial 20,000 iterations as

2 √ 0.4 burn-in, we compute the batch means estimate σBM,f (n, an, bn) for (i) bn = n, (ii) bn = n and (iii) bn = −5 n1/3+10 separately with the first (after burn-in) n = 5000, 10,000, 50,000, 100,000 and 500,000 iterations.

2 The estimates are subsequently standardized by the population mean σf = 1.5 and the corresponding √ 2 √ p population standard deviations 2σf / an = 1.5 2/an. For each n, these standardized estimates from different replicates are then collected and their frequentist sampling distributions are plotted as separate

√ 0.4 histograms for different choices of bn (blue histograms for bn = n, red histograms for bn = n , and orange

1/3+10−5 histograms for bn = n ). These histograms, along with overlaid standard normal curves, are displayed in Figure1. From Figure1, the following observations are made. First, as n → ∞ the sampling distributions of the BM variance estimates appear to become more “normal”, i.e., the histograms become more symmetric and

bell shaped, for all choices of bn. This is a direct consequence of the CLT proved in Theorem 2.1. Second, of √ the three choices of bn considered, the BM variance estimates associated with bn = n are the least biased,

0.4 1/3+10−5 followed by bn = n , and the estimates associated with bn = n are the most biased. This is not √ surprising, as (2.10) and (2.11) show that the asymptotic bias is of the same order of an/bn. As n → ∞, Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 13

= = 0.4 (1 3) (1 105) bn n bn n bn = n n

0.4 n =

0.3 5,000 0.2 0.1 0.0

0.4 n =

0.3 10,000 0.2 0.1 0.0

0.4 n =

0.3 50,000 0.2 0.1 Density 0.0 n

0.4 =

0.3 100,000 0.2 0.1 0.0 n

0.4 =

0.3 500,000 0.2 0.1 0.0 −5.0 −2.5 0.0 2.5 5.0−5.0 −2.5 0.0 2.5 5.0−5.0 −2.5 0.0 2.5 5.0 Standardized BM variance estimate

Fig 1. Frequentist sampling distribution of the batch means MCMC variance estimator in the toy normal example. The sampling 2 p 2 p distribution of the standardized (with mean = σf = 1.5 and standard deviation = 2/anσf = 2/an1.5) batch means MCMC 2 variance estimator σˆBM,f for the x-subchain obtained from 5,000 replicates are plotted as a matrix of histograms for various choices of n and bn. For each n ∈ {5,000, 10,000, 50,000, 100,000, 500,000} (plotted along the vertical direction of the √ 0.4 histogram matrix), the blue histogram (left most panel) corresponds to bn = n, red (middle panel) corresponds to bn = n 1/3+10−5 and orange (right most panel) corresponds to bn = n . The overlaid black curve on each histogram corresponds to the standard normal density function.

√ the bias goes to zero, a fact that is well illustrated through the histograms for bn = n (blue histograms)

0.4 1/3+10−5 and bn = n (red histograms). For bn = n (orange histograms) a much larger n is required. Finally, to assess the practical utility of the proposed CLT, we note frequentist empirical coverage of

2 approximate normal confidence intervals for the true MCMC variance σf . In each replicate for each 2 (n, bn) pair we first construct a 95% approximate normal confidence interval with boundsσ ˆBM,f (n, an, bn) ± p 2 1.96 2/anσˆBM,f (n, an, bn). Then we compute the frequentist coverages of these 95% confidence intervals by 2 evaluating the proportion of replicates where the corresponding interval contains the true σf = 1.5, sepa-

rately for each for each (n, bn) pair. These frequentist coverages are displayed in Table1, which shows near

√ 0.4 perfect coverage for bn = n even for moderate n (≥ 50, 000), increasingly better coverage for bn = n

1/3+10−5 (with moderately large n), and poor coverage for bn = n even for large n (= 500, 000). These results are in concordance with the histograms displayed in Figure1, and demonstrates that for the current problem √ bn = n provides the fastest asymptotic normal convergence among the three choices of bn considered. Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 14

√ 0.4 1/3+10−5 n bn = n bn = n bn = n 5,000 0.924 0.902 0.814 10,000 0.927 0.907 0.810 50,000 0.946 0.932 0.825 100,000 0.943 0.934 0.835 500,000 0.949 0.941 0.834 Table 1 2 Frequentist coverages of approximate normal 95% confidence intervals for the MCMC variance σf based on the batch means 2 estimator σBM,f (n, an, bn) for various choices of n and bn.

3.2. Real data example: data augmentation Gibbs sampler for Bayesian lasso regression

This section illustrates the applicability of the proposed CLT in a real world application. Consider the linear regression model

2 Y | µ, β, η ∼ Nm(µ + Xβ, η Im) where Y ∈ Rn is a vector of responses, X is a non-stochastic m × p design matrix of standardized covariates, β ∈ Rp is a vector of unknown regression coefficients, η2 > 0 is an unknown residual variance, µ ∈ R is an unknown intercept, Nd denotes the d-variate (d ≥ 1) normal distribution and Im denotes the m-dimensional identity matrix. Interest lies in the estimation of β and η2. In many modern-day applications, the sample size m is smaller than the number p of covariates. For a meaningful estimation of β in such a scenario regularization (i.e., shrinkage towards zero) of the estimate is necessary. A particularly useful regularization approach involves the use of a lasso penalty [35], producing lasso estimates of the regression coefficients. The Bayesian lasso framework [25] provides a probabilistic approach to quantifying uncertainties in the lasso estimation. Here, one considers the following hierarchical priors for β:

2 β ∼ Np(0, η Dτ )

2 τj ∼ i.i.d. Exponential(rate = λ /2) and estimates β through the associated posterior distribution obtained from the Bayes rule:

posterior density ∝ prior density × likelihood.

Here Dτ is the diagonal matrix Diag{τ1, . . . , τp}, and λ > 0 is a prior hyper-parameter that determines the amount of sparsity in β. Note that the marginal (obtained by integrating out τj’s) prior for β is a product of independent Laplace densities, and the associated marginal posterior mode of β corresponds to the frequentist lasso estimate of β.

It is clear that the target posterior distribution of β, σ and τ = (τ1, . . . , τp) is intractable, i.e., it is not avaialable in closed form, and i.i.d. random generation from the distribution is infeasible. Park and Casella [25] suggested a three-block Gibbs sampler for MCMC sampling from the target posterior which was later shown to be geometrically ergodic [19]. A more efficient (in an operator theoretic sense) two-block version of this Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 15 three-block Gibbs sampler has been recently proposed in Rajaratnam et al. [28], where the authors prove the trace-class property of the proposed algorithm, which in particular, also implies geometric ergodicity (recall that a two-block Gibbs sampler is always reversible). One iteration of the proposed two-block Gibbs sampler consists of the following random generations.

1. Generate (β, η2) from the following conditional distributions:

(m + p − 1) 1 2 1  η2 | τ, Y ∼ Inverse-Gamma , Y˜ − Xβ + βT D−1β/2 2 2 2 τ 2  −1 T ˜ 2 −1 β | η , τ, Y ∼ Np Aτ X Y , η Aτ .

2. Independently generate τ1, . . . , τp such that the full conditional distribution of 1/τj, j = 1, . . . , p is given by s 2 ! 2 λη 1/τj | β, η ,Y ∼ Inverse-Gaussian 2 , λ . βj ˜ −1 T T −1 Here Y = Y − m (Y 1m)1m, 1m being the m-component vector of 1’s, and Aτ = X X + Dτ . For a real world application of the above sampler we consider the gene expression data of Scheetz et al. [33], made publicly available in the R package flare [21] as the data set entitled eyedata. The data set consists of m = 120 observations on a response variable (expression level) and p = 200 predictor variables (gene probes). Rajaratnam et al. [28] analyze this data set in the context of the Bayesian lasso regression, and provide an efficient R implementation of the aforementioned two-block Gibbs sampler in their supplementary document. Following [28] we standardize the columns of design matrix X and choose the prior (sparsity) as λ = 0.2185 which ensures that the frequentist lasso estimate (marginal posterior mode) of β has min{m, p}/2 = 60 non-zero elements. We focus on the marginal (β, η2) chain of the Bayesian lasso Gibbs sampler described above. This marginal chain is reversible, and we seek to estimate the MCMC variance of the linear regression log-

m 1 f(β, η2, τ) = − log(η2) − kY˜ − Xβk2 2 2η2 2

using the batch means variance estimator. To empirically assess the asymptotic behavior of this estimator, we obtain its frequentist sampling distribution as described in the following. We generate 5,000 replicates of the above Markov chain with independent random starting points (the initial β is generated from a standard multivariate normal distribution and the initial η2 is generated from an independent standard exponential distribution). The R script provided in the supplementary document in [28] is used for the Markov chain generations. On each replicate we run 120,000 iterations of the Markov chain, discard the initial 20,000 iterations as burn-in, and evaluate the log-likelihood at the remaining 100,000 iterations. The BM variance

2 estimator σBM,f is subsequently computed from the evaluated log-likelihood f at the first n = 5,000, 10,000, √ 0.4 1/3+10−5 50,000 and 100,000 iterations and for bn = n, bn = n and bn = n , and the resulting replicated Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 16

2 estimates are then collected for each (n, bn) pair. Since the true MCMC variance σBM,f is of course unknown here, we focus on the asymptotic normality of only approximately standardized estimates over replications. More specifically, we first evaluate the mean (over replications) batch means estimate

5000 1 X σˆ2 (n = 100, 000, a , b ) = σˆ2 (n = 100, 000, a , b ) BM,f n n 5000 BM,f n n (l) l=1

2 where for each bn (and hence an) σˆ BM,f (n = 100, 000, an, bn)(l) denotes the corresponding batch means vari-

2 ance estimate obtained from the lth replicate with n = 100, 000, l = 1,..., 5000. The estimates σˆ BM,f (n =

100, 000, an, bn) for the above three choices of bn are displayed in Table2.

√ 0.4 1/3+10−5 bn n n n

2 σˆ BM,f (n = 100, 000, an, bn) 304.351 302.385 299.091 Table 2 2 2 The mean (over 5000 replications) batch means estimate σˆ BM,f (n = 100, 000, an, bn) of σf obtained from replicated MCMC √ 0.4 1/3+10−5 draws each with iteration size n = 100,000 and batch sizes bn = n, n and n .

2 After computing σˆ BM,f (n = 100, 000, an, bn)(l), we standardize all replicated batch means estimates

2 2 p with mean = σˆ BM,f (n = 100, 000, an, bn) and standard deviation = σˆ BM,f (n = 100, 000, an, bn) 2/an separately for each (n, bn) pair. The frequentist sampling distributions of these approximately standardized estimates are plotted as a matrix of histograms for various choices of n and bn, along with overlaid standard normal density curves, in Figure2. From the figure, it follows that these sampling distributions of the approximately standardized estimates are very closely approximated by a standard normal distribution. Of course, unlike the histograms displayed in Figure1 for the toy normal example (Section 3.1), no information on the bias of the estimates can be obtained here. However, these histograms do demonstrate the remarkable accuracy of an asymptotic normal approximation, and thus illustrates the applicability of the proposed CLT for the batch means MCMC variance estimate in a real world application.

References

[1] Alj, A., Azrak, R., and M´elard,G. (2014). On conditions in central limit theorems for martingale difference arrays. Economics letters, 123(3):305–307. [2] Besag, J. (1986). On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society: Series B (Methodological), 48(3):259–279. [3] Bratley, P., Fox, B. L., and Schrage, L. E. (2011). A guide to simulation. Springer Science & Business Media. [4] Chakraborty, S. and Khare, K. (2019). Consistent estimation of the spectrum of trace class data aug- mentation algorithms. Bernoulli, 25(4B):3832–3863. Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 17

= = 0.4 (1 3) (1 105) bn n bn n bn = n n

0.4 n =

0.3 5,000 0.2 0.1 0.0

0.4 n =

0.3 10,000 0.2 0.1 0.0

0.4 n = Density

0.3 50,000 0.2 0.1 0.0

0.4 n =

0.3 100,000 0.2 0.1 0.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 Appoximately standardized BM variance estimates

Fig 2. Frequentist sampling distribution of the batch means MCMC variance estimator in the Bayesian lasso example. The 2 sampling distribution of the approximately standardized (with mean = σˆ BM,f (n = 100, 000, an, bn) and standard deviation 2 p 2 = σˆ BM,f (n = 100, 000, an, bn) 2/an, see Table2) batch means MCMC variance estimator σˆBM,f (n, an, bn) for the linear regression log-likelihood function f evaluated at the iterations of the Bayesian lasso two block Gibbs sampler are plotted as matrix of histograms for various choices of n and bn. For each n ∈ {5,000, 10,000, 50,000, 100,000, 500,000√ } (plotted along the vertical direction of the histogram matrix), the blue histogram (left most panel) corresponds to bn = n, red (middle panel) 0.4 1/3+10−5 corresponds to bn = n and orange (right most panel) corresponds to bn = n . The overlaid black curve on each histogram corresponds to the standard normal density function.

[5] Chien, C., Goldsman, D., and Melamed, B. (1997). Large-sample results for batch means. Management Science, 43(9):1288–1295. [6] Chien, C.-H. (1988). Small-sample theory for steady state confidence intervals. In Proceedings of the 20th conference on Winter simulation, pages 408–413. ACM. [7] Damerdji, H. (1991). Strong consistency and other properties of the spectral variance estimator. Man- agement Science, 37(11):1424–1440. [8] Diaconis, P., Khare, K., and Saloff-Coste, L. (2008). Gibbs sampling, exponential families and orthogonal polynomials. Statistical Science, 23(2):151–178. [9] Fishman, G. (2013). Monte Carlo: concepts, algorithms, and applications. Springer Science & Business Media. [10] Flegal, J. M., Haran, M., and Jones, G. L. (2008). Markov chain monte carlo: Can we trust the third significant figure? Statistical Science, pages 250–260. [11] Flegal, J. M. and Jones, G. L. (2010). Batch means and spectral variance estimators in Markov chain Monte Carlo. Ann. Statist., 38(2):1034–1070. [12] Geyer, C. J. (1992). Practical markov chain monte carlo. Statistical science, pages 473–483. [13] Glynn, P. W. and Iglehart, D. L. (1990). Simulation output analysis using standardized time series. Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 18

Mathematics of Operations Research, 15(1):1–16. [14] Glynn, P. W. and Whitt, W. (1991). Estimating the asymptotic variance with batch means. Operations Research Letters, 10(8):431–435. [15] Hobert, J. P., Jones, G. L., Presnell, B., and Rosenthal, J. S. (2002). On the applicability of regenerative simulation in markov chain monte carlo. Biometrika, 89(4):731–743. [16] Jones, G. L. et al. (2004). On the markov chain central limit theorem. Probability surveys, 1(299-320):5– 1. [17] Jones, G. L., Haran, M., Caffo, B. S., and Neath, R. (2006). Fixed-width output analysis for markov chain monte carlo. Journal of the American Statistical Association, 101(476):1537–1547. [18] Kass, R. E., Carlin, B. P., Gelman, A., and Neal, R. M. (1998). Markov chain monte carlo in practice: a roundtable discussion. The American Statistician, 52(2):93–100. [19] Khare, K., Hobert, J. P., et al. (2013). Geometric ergodicity of the bayesian lasso. Electronic Journal of Statistics, 7:2150–2163. [20] Li, X., Zhao, T., Wang, L., Yuan, X., and Liu, H. (2019a). flare: Family of Lasso Regression. R package version 1.6.0.2. [21] Li, X., Zhao, T., Wang, L., Yuan, X., and Liu, H. (2019b). flare: Family of Lasso Regression. R package version 1.6.0.2. [22] Meyn, S. and Tweedie, R. (2012). Markov Chains and Stochastic Stability. Communications and Control Engineering. Springer London. [23] Mykland, P., Tierney, L., and Yu, B. (1995). Regeneration in markov chain samplers. Journal of the American Statistical Association, 90(429):233–241. [24] Neal, R. M. et al. (2011). Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2. [25] Park, T. and Casella, G. (2008). The bayesian lasso. Journal of the American Statistical Association, 103(482):681–686. [26] Qin, Q., Hobert, J. P., and Khare, K. (2019). Estimating the spectral gap of a trace-class markov operator. Electron. J. Statist., 13(1):1790–1822. [27] R Core Team (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [28] Rajaratnam, B., Sparks, D., Khare, K., and Zhang, L. (2019). Uncertainty quantification for mod- ern high-dimensional regression via scalable bayesian methods. Journal of Computational and Graphical Statistics, 28(1):174–184. [29] Ripley, B. D. (2009). Stochastic simulation, volume 316. John Wiley & Sons. [30] Roberts, G. and Rosenthal, J. (1997). Geometric ergodicity and hybrid Markov chains. Electron. Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 19

Commun. Probab., 2:13–25. [31] Roberts, G. O. (1995). Markov chain concepts related to sampling algorithms. In Gilks, W. R., Richard- son, S., and Spiegelhalter, D., editors, Markov chain Monte Carlo in practice, pages 45–57. Chapman and Hall/CRC, London. [32] Roberts, G. O. and Rosenthal, J. S. (2004). General state space Markov chains and MCMC algorithms. Probab. Surveys, 1:20–71. [33] Scheetz, T. E., Kim, K.-Y. A., Swiderski, R. E., Philp, A. R., Braun, T. A., Knudtson, K. L., Dor- rance, A. M., DiBona, G. F., Huang, J., Casavant, T. L., et al. (2006). Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences, 103(39):14429–14434. [34] Song, W. T. and Schmeiser, B. W. (1995). Optimal mean-squared-error batch sizes. Management Science, 41(1):110–123. [35] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288. [36] Wickham, H. (2017). tidyverse: Easily Install and Load the ’Tidyverse’. R package version 1.2.1.

Appendix A: Proofs of Results used in Lemma 2.1

Proposition A.1. Consider ξk,n as defined in (2.4), and assume that the assumptions in Theorem 2.1 hold. Then an 1 X 2 P E(ξk,n | Fk−1,n) −→ 1. an − 1 k=2 2 Proof. Observe that, due to the of (Xn)n≥1, E(ξk,n | Fk−1,n) is a function only of X(k−1)bn , ˇ 2 ˇ 2 for all k = 2, . . . , an. Define h(X(k−1)bn ) = E(ξk,n | Fk−1,n) − 1, with h(Xkbn ) ∈ L0(π) for all k, n, as 8 2 Eπ(f ) < ∞ and Eπ(ξk,n) = 1. It is enough to show that the mean squared convergence 2 " an−1 # 1 X ˇ Eπ h(X(k−1)bn ) → 0 an − 1 k=1 holds. To this end, note that  2 1 ˇ Eπ h(X(k−1)bn ) an − 1

an−1 1 X  2 2 XX   ˇ ˇ ˇ 0 = 2 Eπ h(Xkbn ) + 2 Eπ h(Xkbn )h(Xk bn ) . (A.1) (an − 1) (an − 1) 0 k=1 1≤k

an−1 1 X ˇ 2 1 ˇ 2 2 Eπ h(Xkbn ) = khkπ → 0 (an − 1) an k=1 Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 20 as n → ∞, and it remains to show that the second term in (A.1) also converges to zero. Note that,

1 XX   ˇ ˇ 0 2 Eπ h(Xkbn )h(Xk bn ) (an − 1) 0 1≤k

(??) 1 0 XX ˇ 2 bn(k −k) ≤ 2 khkπλ (an − 1) 0 1≤k

an−1 an−1−k 1 2 ˇ X X rbn = 2 h π λ (an − 1) k=1 r=1 2 hˇ an−1 ∞ π X X rbn ≤ 2 λ (an − 1) k=1 r=1 ˇ 2 h π 1 = rb → 0 (an − 1) 1 − λ n as n → ∞, where (?) follows from the Schwarz inequality, and (??) follows from the operator norm inequality ˇ kKhkπ ≤ kKkkhkπ, and as before we let λ = kKk with λ ∈ (0, 1) due to geometric ergodicity of (Xn)n≥1. This completes the proof.

 2 4  4 Proposition A.2. Under the setup assumed in Theorem 2.1, we have Eπ bnY k → 3σf as n → ∞, for

each k = 1, . . . , an.

4 4 Proof. On the outset, note that since (Xn)n≥0 is stationary, Eπ(Y 1) = Eπ(Y k). Moreover, since bn → ∞ as n → ∞, it is therefore enough to show that as n → ∞,

1 E (Y + Y + ··· + Y )4 → 3σ4. n2 π 1 2 n f

For the remainder of the proof, we shall therefore replace bn by n. We will proceed by expanding Eπ(Y1 +

4 r r Y2 +···+Yn) and analyzing relevant terms separately. First, let us define µr = Eπ(Y1 ) = Eπ[f(Xi)−Eπf] 8 for r = 2, 4, 6. Note that Eπ(f ) < ∞ implies that µr < ∞ for all r = 2, 4, 6. Now observe that, 1 E (Y + Y + ··· + Y )4 n2 π 1 2 n  n  1 X X X X X = E Y 4 + 4 Y 3Y + 6 Y 2Y 2 + 12 Y 2Y Y + Y Y Y Y n2 π  i i j i j i j k i j k l i=1 i6=j i

= U1 + U2 + U3 + U4 + U5, say, and we shall consider the convergence of each Ui, i = 1,..., 5 separately. Since µ4 < ∞ and the chain is −1 Pn 4 stationary, it follows that Eπ(n i=1 Yi ) = µ4 for all n, so that n ! n ! 1 X 1 1 X U = E Y 4 = E Y 4 → 0. (A.2) 1 π n2 i n π n i i=1 i=1

As for U2, note that,

4 X 3  |U2| = Eπ Y Yj n2 i i6=j 4 X 3  ≤ Eπ Y Yj n2 i i6=j

8 X 3  = Eπ Y Yj n2 i i

8 X  3  = EπE f (Xi)f0(Xj) | Xi n2 0 i

8 X  3 j−i  = Eπ f (Xi)K f0(Xi) n2 0 i

8 X 3 j−i ≤ Eπ f (Xi)K f0(Xj) n2 0 i

U2 → 0 as n → ∞. (A.3)

Next we focus on U3. Since

 2 2  2 2  2 h 2 ˜ i 2 Eπ Yi Yj = Eπ Yi Yj − µ2 + µ2 = Eπ f0 (Xi)S(Xj) + µ2

˜ 2 2 where S(x) = f0 (x) − µ2 ∈ L0(π), therefore, 6 X 6 X h i 6 n(n − 1) U = E Y 2Y 2 = E f 2(X )S˜(X ) + µ2 3 n2 π i j n2 π 0 i j n2 2 2 i

6 X h i  1  = E f 2(X )S˜(X ) + 3 1 − µ2. n2 π 0 i j n 2 i

Now

6 X h 2 ˜ i 6 X h 2 ˜ i Eπ f (Xi)S(Xj) ≤ Eπ f (Xi)S(Xj) n2 0 n2 0 i

Here (?2) follows from Schwarz’s inequality. Consequently,

2 U3 → 3µ2 as n → ∞. (A.4)

Next we consider U4. Observe that   12 X X X U = E Y 2Y Y  + E Y 2Y Y  + E Y 2Y Y  4 n2  π i j k π i j k π i j k  i

(1) (2) = U4 + U4 , say.

2 ˜ 2 2 Here Yfi = S(Xi) = Yi − µ2 ∈ L (π). Note that   (1) 12 X X U = µ hf ,Kj−kf i + hf ,Kj−kf i 4 n2 2  0 0 π 0 0 π i

n−2 12 X = µ (n − r − 1)(n − r) hf ,Krf i n2 2 0 0 π r=1 n−2 X  r − 1  r  = 12 µ 1 − 1 − hf ,Krf i 2 n n 0 0 π r=1 ∞ ∞ X r X → 12 µ2 hf0,K f0iπ = 12 µ2 γr (A.5) r=1 r=1 as n → ∞, where γh’s are the auto-covariances as defined in (2.8), and the last convergence follows from the (2) dominated convergence theorem. As for U4 , observe that   (2) 12 X  2  X  2  X 2  U ≤ Eπ YfYjYk + Eπ YfYjYk + Eπ Y YjYk . (A.6) 4 n2  i i i  i

For i < j < k,

  (? ) h  i 2 3 2 Eπ Yfi YjYk = Eπ YjYkE Yfi | Xj,Xk (? ) h  i 4 2 = Eπ YjYkE Yfi | Xj

j−i ˜ 2 ≤ Eπ YjYkK Yj 1 (?5) 1   22 2  2 2 2 j−i ˜ ≤ Eπ Yj Yk Eπ K Yj r (?6) q 1 1  4  4 2 4 2 j−i ˜ ≤ Eπ Yj [Eπ (Yk )] λ Eπ Yj

j−i ≤ 4λ µ4. (A.7)

Here (?3) and (?4) are consequences of reversibility and Markov property respectively, and (?5) and (?6) are due to Schwarz’s inequality. Again for i < j < k,

  h i 2 2 Eπ Yfi YjYk = Eπ Yfi YjE (Yk | Xi,Xj) (? ) h i 7 2 = Eπ Yfi YjE (Yk | Xj)

2 k−j ≤ Eπ Yfi YjK Yj

(?8) k−j√ = 8λ µ2µ6 (A.8)

where (?7) is due to the Markov property, and (?8) follows from H¨older’sinequality. Therefore, from (A.7) and (A.8), we get

h i √ 2  j−i k−j Eπ Yfi YjYk ≤ min λ , λ (4µ4 + 8 µ2µ6) √ 2 max{j−i,k−j} √ = λ (4µ4 + 8 µ2µ6) √ k−i √ ≤ λ (4µ4 + 8 µ2µ6) Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 24 where the last inequality is a consequence of the fact that for two real numbers a and b, a + b ≤ 2 max{a, b} and that λ = kKk ∈ (0, 1). Hence,

12 X   12 √ X √ k−i E Yf2Y Y ≤ (4µ + 8 µ µ ) λ n2 π i j k n2 4 2 6 i

By similar arguments, it can be shown that

12 X   12 X   E Yf2Y Y → 0, and E Yf2Y Y → 0 n2 π i j k n2 π i j k j

It follows from (A.5) and (A.9) that

∞ (1) (2) X U4 = U4 + U4 → 12 µ2 γh as n → ∞. (A.10) r=1

Finally, we focus on U5. Note that

24 X U = E (Y Y Y Y ) 5 n2 π i j k l i

(1) (2) = U5 + U5 , say.

Then,

(1) 24 X U = hf, Kj−ifi hf, Kl−kfi 5 n2 π π i

n b 2 −1c 24 X (n − 2r − 2)(n − 2r − 1) = hf, Krfi2 n2 2 π r=1 0 0 24 X [n − (r + r ) − 2] [n − (r + r ) − 1] 0 + hf, Krfi hf, Kr fi n2 2 π π 2≤r+r0≤n−2

 ∞  (?9) X r 2 X r r0 −−→ 12  hf, K fiπ + hf, K fiπhf, K fiπ r=1 r6=r0 Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 25

∞ !2 ∞ !2 X r X = 3 2 hf, K fiπ = 3 2 γr (A.11) r=1 r=1

(2) where (?9) follows from the dominated convergence theorem. As for U5 , observe that

(2) 24 X U ≤ |E ([Y Y − E (Y Y )] Y Y )| . 5 n2 π i j π i j k l i

l−k  |Eπ ([YiYj − Eπ (YiYj)] YkYl)| = Eπ [YiYj − Eπ (YiYj)] Yk K f0(Xk) (? ) 1 1 10 h  2 2i 2 h l−k 2i 2 ≤ Eπ [YiYj − Eπ (YiYj)] Yk Eπ K f0(Xk)

√ l−k ≤ 8 µ2µ6 λ (A.12)

and due to reversibility,

√ j−i |Eπ ([YiYj − Eπ (YiYj)] YkYl)| = |Eπ (YiYj [YkYl − Eπ (YkYl)])| ≤ 8 µ2µ6 λ . (A.13)

Finally, we let

2 H(Xj) = E [(YiYj − Eπ (YiYj)) | Xj,Xk,Xl] = E [(YiYj − Eπ (YiYj)) | Xj] ∈ L0(π) with the equality being a consequence of the Markov property. Then, for i < j < k < l,

|Eπ ([YiYj − Eπ (YiYj)] YkYl)| = |Eπ (H(Xj)YkYl)|

= |EπE [H(Xj)YkYl | Xk,Xl]|

k−j ≤ Eπ K H(Xk)YkYl

1 (?11) h 2i 2 1 k−j   2 2 2 ≤ Eπ K H(Xk) Eπ Yk Yl

(?12) 1 1 k−j  2  2 2 ≤ λ Eπ H (Xk) µ4

k−j ≤ 4λ µ4. (A.14)

It follows from (A.12), (A.13) and (A.14) that

 l−k j−i k−j √ |Eπ ([YiYj − Eπ (YiYj)] YkYl)| ≤ min λ , λ , λ (4µ4 + 8 µ2µ6)

max{l−k,j−i,k−j} √ ≤ λ (4µ4 + 8 µ2µ6)

 1 3 max{l−k,j−i,k−j} √ = λ 3 (4µ4 + 8 µ2µ6)

 1 l−i √ ≤ λ 3 (4µ4 + 8 µ2µ6) .

Hence,

l−i (2) 24 X  1  √ U ≤ λ 3 (4µ + 8 µ µ ) 5 n2 4 2 6 i

n−1   r 24 √ X r − 1  1  = (4µ + 8 µ µ ) (n − r) λ 3 n2 4 2 6 2 r=3 ∞ r 24 √ X 2  1  ≤ (4µ + 8 µ µ ) r λ 3 n 4 2 6 r=1

24 √ 1  1   1 −3 = (4µ + 8 µ µ ) λ 3 1 + λ 3 1 − λ 3 → 0 as n → ∞. (A.15) n 4 2 6

Therefore, from (A.11) and (A.15), it follows that

∞ !2 X U5 → 3 2 γr as n → ∞. (A.16) r=1 Finally, combining (A.2), (A.3), (A.4), (A.10) and (A.16), we get

1 h i E (Y + Y + ··· + Y )4 = U + U + U + U + U n2 π 1 2 n 1 2 3 4 5 ∞ ∞ !2 2 X X → 3µ2 + 12µ2 γr + 3 2 γr r=1 r=1 ∞ !2 X 4 = 3 µ2 + 2 γr = 3σf as n → ∞. r=1 This completes the proof.

Proposition A.3. Under the setup assumed in Theorem 2.1, and if in addition the Markov chain is sta-

 2 2 2 4 tionary, then Eπ bnY 1Y 2 → σf as n → ∞.

Proof. We have      bn 2bn 2bn  2 2 2 1 X X 2 2 X X 2 0 Eπ bnY 1Y 2 = 2 Eπ  Yi Yj  + Eπ  YiYi Yj  bn 0 i=1 j=bn+1 i6=i j=bn+1     bn X X 2 X X 2 + Eπ  Yi YjYj0  + Eπ  YiYi0 YjYj0  i=1 j6=j0 i6=i0 j6=j0   bn 2bn 2 1 X X 2  2 2 = µ2 + 2 Eπ  Yj Yi − Eπ Yi  bn i=1 j=bn+1   2bn 1 X i0−i 1 X X  2 2 0 + 2 2bnµ2 hf0,K f0iπ + 2 Eπ  YiYi Yj − Eπ Yj  bn 0 bn 0 i

2 1 X i0−i 1 X j0−j = µ2 + T1 + 2µ2 hf0,K f0iπ + T2 + 2µ2 hf0,K f0iπ bn bn i

By analysis similar to the proof of Proposition A.2, it follows that for each i = 1, 2, 3, 4, Ti → 0 as n → ∞. Therefore, by the dominated convergence theorem, as n → ∞,

∞ ∞ !2 ∞ !2  2 2 2 2 X r X r X r 4 Eπ bnY1 Y2 → µ2 + 4µ2 hf0,K f0iπ + 2 hf0,K f0iπ = µ2 + hf0,K f0iπ = σf . r=1 r=1 r=1 This completes the proof.

2 4 Proposition A.4. Consider the quantity Un as defined in (2.5). We have Eπ(Un) → 2σf as n → ∞.

Proof. We have,

2 2 h 2 ˜ i Eπ(Un) = Eπ bnY 1 − bnh(X0) + bnh(Xbn )

 2 4  2 2  h 2 ˜2 i h 2 2 i = Eπ bnY 1 + Eπ bnh (X0) + Eπ bnh (Xbn ) − 2Eπ bnY 1h(X0)

h 2 ˜ i h 2 2˜ i − 2Eπ bnh(X0)h(Xbn ) + 2Eπ bnY 1h(Xbn ) 2  2 4  2 2   2 2  h  2i = Eπ bnY 1 + Eπ bnh (X0) + Eπ bnh (X0) − Eπ bnY 1

 2 2  2  2  2 2  2 ˜ bn ˜ 2 2 − 2Eπ bnh (X0) − 2bnhh, K hiπ + 2bnEπ Y 1 Y 2 − 2bnEπ Y 1 Eπ Y 2

 4 h  2i2  2 2 2 2 2 ˜ bn ˜ = Eπ bnY 1 − 3 Eπ bnY 1 + 2bnEπ Y 1 Y 2 − 2bnhh, K hiπ

 2 2 2 4 4 Of course, Eπ bnY 1 → σf , and from Proposition A.2 and A.3, it follows that as n → ∞, Eπ(bnY 1) → 3σf 2 2 2 4 and bnEπ(Y 1 Y 2) → σf respectively. Finally,

2  4 2 ˜ bn ˜ 2 bn ˜ bn 2 bnhh, K hiπ ≤ bn λ h ≤ λ Eπ bnY 1 → 0 π

as n → ∞. Consequently, 2 4 Eπ Un → 2σf as n → ∞.

This completes the proof.

2 Proposition A.5. Consider the quantity Vn as defined in (2.6). We have Eπ(Vn ) → 0 as n → ∞.

Proof. We have

2 2 h ˜ i Eπ[Vn ] = Eπ bng˜ (Xbn ) − bnh (Xbn ) − Eπ [bng˜ (Xbn ) | X0]

h  −1  i2 bn  ˜ bn = Eπ bn I − K − I h(Xbn ) − K g˜(X0) Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 28

h  −1  i2 2 bn  ˜  bn  ≤ 2Eπ bn I − K − I h(Xbn ) + 2Eπ bnK g˜(X0) 2 2 −1 2 2 bn  ˜ bn ≤ 2 I − K − I bnh + 2 K kbng˜kπ π −1 2 2 2 −1 2 2 bn  ˜ bn bn  ˜ ≤ 2 I − K − I bnh + 2 K I − K bnh π π 2bn 2bn λ  2 4 λ  2 4 ≤ 2 2 Eπ bnY 1 + 2 2 Eπ bnY 1 (1 − λbn ) (1 − λbn )

2bn λ  2 4 = 4 2 Eπ bnY 1 (1 − λbn )

 2 4 4 2 where λ = kKk ∈ (0, 1). From Proposition A.2 it follows that Eπ bnY 1 → 3σf . Hence, Eπ(Vn ) → 0 as n → ∞. This completes the proof.