Estimating accuracy of the MCMC variance estimator:
a central limit theorem for batch means estimators
Saptarshi Chakraborty∗, Suman K. Bhattacharya† and Kshitij Khare†
∗Department of Epidemiology & Biostatistics Memorial Sloan Kettering Cancer Center 485 Lexington Ave New York, NY 10017, USA e-mail: [email protected]
†Department of Statistics University of Florida 101 Griffin Floyd Hall Gainesville, Florida 32601, USA e-mail: [email protected] e-mail: [email protected] Abstract: The batch means estimator of the MCMC variance is a simple and effective measure of accuracy for MCMC based ergodic averages. Under various regularity conditions, the estimator has been shown to be consistent for the true variance. However, the estimator can be unstable in practice as it depends directly on the raw MCMC output. A measure of accuracy of the batch means estima- tor itself, ideally in the form of a confidence interval, is therefore desirable. The asymptotic variance of the batch means estimator is known; however, without any knowledge of asymptotic distribution, asymptotic variances are in general insufficient to describe variability. In this article we prove a central limit theorem for the batch means estimator that allows for the construction of asymptotically accu- rate confidence intervals for the batch means estimator. Additionally, our results provide a Markov chain analogue of the classical CLT for the sample variance parameter for i.i.d. observations. Our result assumes standard regularity conditions similar to the ones assumed in the literature for proving consistency. Simulated and real data examples are included as illustrations and applications of the CLT. arXiv:1911.00915v1 [stat.CO] 3 Nov 2019 MSC 2010 subject classifications: Primary 60J22; secondary 62F15. Keywords and phrases: MCMC variance, batch means estimator, asymptotic normality.
1. Introduction
Markov chain Monte Carlo (MCMC) techniques are indispensable tools of modern day computations. Rou- tinely used in Bayesian analysis and machine learning, a major application of MCMC lies in the approxima- tion of intractable and often high-dimensional integrals. To elaborate, let (X , F, ν) be an arbitrary measure space and let Π be a probability measure on X , with associated density π(·) with respect to ν. The quantity
1 Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 2 of interest is the integral Z Z πf = Eπf := f(x) dΠ(x) = f(x) π(x) ν(dx) X X
where f is a real-valued, Π−integrable function on X . In many modern applications, the such an integral is often intractable, i.e., (a) does not have a closed form, (b) deterministic approximations are inefficient, often due to the high dimensionality of X , and (c) cannot be estimated via classical or i.i.d. Monte Carlo techniques as i.i.d. random generation from Π is in general infeasible. Markov chain Monte Carlo (MCMC)
techniques are the to-go method of approximation for such integrals. Here, a Markov chain (Xn)n≥1 with an invariant probability distribution Π [see, e.g. 22, for definitions] is generated using some MCMC sampling
technique such as the Gibbs sampler or the Metroplis Hastings algorithms. Then, ergodic averages f n := −1 Pn n i=1 f(Xi) based on realizations of the Markov chain (Xn)n≥1 are used as approximations of Eπf. Measuring the errors incurred in approximations is a critical step in any numerical analysis. It is well known that when a Markov chain is Harris ergodic (i.e., aperiodic, φ-irreducible, and Harris recurrent [see 22, for definitions]), then ergodic averages based on realizations of the Markov chain always furnish strongly consistent estimates of the corresponding population quantities [22, Theorem 13.0.1]. In other words, if a
Harris ergodic chain is run long enough, then the estimate f n is always guaranteed to provide a reasonable
approximation to the otherwise intractable quantity Eπf (under some mild regularity conditions on f). Determining an MCMC sample (or iteration) size n that justifies this convergence, however, requires a measurement of accuracy. Similar to i.i.d. Monte Carlo estimation, the standard error of f n obtained from the MCMC central limit theorem (MCMC CLT) is the natural quantity to use for this purpose. MCMC CLT
requires additional regularity conditions as compared to its i.i.d. counterpart; if the Markov chain (Xn)n≥1
2+δ is geometrically ergodic (see, e.g., Meyn and Tweedie [22] for definitions), and if Eπ|f| for some δ > 0
2 (or Eπf < ∞ if (Xn)n≥1 is geometrically ergodic and reversible), it can be shown that as n → ∞
√ d 2 n f n − Eπf −→ N(0, σf )
2 where σf is the MCMC variance defined as
∞ 2 X σf = varπ f(X1) + 2 covπ (f(X1), f(Xi)) . (1.1) i=2
Here varπ and covπ respectively denote the variance and (auto-) covariance computed under the stationary distribution Π. Note that other sufficient conditions ensuring the above central limit theorem also exist; see the survey articles of Jones et al. [16], and Roberts and Rosenthal [32] for more details. When the
regularity conditions hold, a natural measure of accuracy for f n is therefore given by the MCMC standard √ error (MCMCSE) defined as σf / n. Note that this formula of MCMCSE, alongside measuring the error in approximation, also helps determine an optimum iteration size n that is required to achieve a pre-specified Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 3
2 level of precision, thus providing a stopping rule for terminating MCMC sampling. A related use of σf also lies 2 in the computation of effective sample size ESS = n varπ f(X1)/σf [18, 29]. ESS measures how n dependent MCMC samples compare to n i.i.d. observations from Π, thus providing a univariate measure of the quality
2 of the MCMC samples. Thus to summarize, the MCMC variance σf facilitates computation/determination of three crucial aspects of an MCMC implementation, namely (a) stopping rule for terminating simulation,
(b) effective sample size (ESS) of the MCMC draws, and (c) precision of the MCMC estimate f n. 2 In most non-trivial applications, however, the MCMC variance σf is usually unknown, and must be 2 estimated. A substantial literature has been devoted to the estimation of σf [see, e.g.,3,9, 12, 13, 14, 23, 31, 10, 11, to name a few], and several methods, such as regerative sampling, spectral variance estimation, and overlapping and non-overlapping batch means estimation, have been developed. In this paper, we focus on the non-overlapping batch means estimator, henceforth called the batch means estimator for simplicity, where
2 estimation of σf is performed by breaking the n = anbn Markov chain iterations into an non-overlapping blocks or batches of equal size bn. Then, for each k ∈ {1, 2, ··· , an}, one calculates the k-th batch mean
1 Pbn 1 Pan Zk := Z , and the overall mean Z := Zk, where Zi = f(Xi) for i = 1, 2,... , and bn i=1 (k−1)bn+i an i=1 2 finally estimates σf by an 2 2 bn X 2 σˆBM,f =σ ˆBM,f (n, an, bn) = Zk − Z . (1.2) an − 1 k=1 The batch means estimator is straightforward to implement, and can be computed post-hoc without making any changes to the original MCMC algorithm, as opposed to some other methods, such as regeneration
2 sampling. Under various sets of regularity conditions, the batch mean estimatorσ ˆBM,f has been shown to 2 be strongly consistent [7, 15, 17, 11] and also mean squared consistent [5, 11] for σf , provided the batch
size bn and the number of batches an both increase with n. Note that the estimator depends on the choice
of the batch size bn (and hence the number of batches an = n/bn). Optimal selection of the batch-size is
1/2 1/3 still an open problem, and both bn = n and bn = n have been deemed desirable in the literature; the
former ensures that the batch means {Zk} approach asymptotic normality at the fastest rate (under certain
2 regularity conditions, [6]), and the latter minimizes the asymptotic mean-squared error ofσ ˆBM,f (under different regularity conditions, [34]). It is however important to recognize that consistency alone does not in general justify practical usefulness, and a measurement of accuracy is always required to assess the validity of an estimator. It is known that
2 4 the asymptotic variance of the batch means estimator is given by varσ ˆBM,f = 2σf /an + o(1/n), under various regularity conditions [5, 11]. However, without any knowledge of the asymptotic distribution, the asymptotic variance alone is generally insufficient for assessing the accuracy of an estimator. For example, a ±2 standard error bound does not in general guarantee more than 75% coverage as obtained from the Chebyshev inequality, and to ensure a pre-specified (95%) coverage, a much larger interval (∼ ±4.5 standard Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 4 error) is necessary in general. This provides a strong practical motivation for determining the asymptotic distribution of the batch means estimator. To the best of our knowledge, however, no such result is available. The main purpose of this paper is to establish a central limit theorem that guarantees asymptotic normality of the batch means estimator under mild and standard regularity conditions (Theorem 2.1). There are two major motivations for our work. As discussed above, the first motivation lies in the immediate practical implication of this work. As a consequence of the CLT, the use of approximate normal confidence intervals for measuring accuracy of batch means estimators is justified. Given MCMC samples, such intervals can be computed alongside the batch means estimator at virtually no additional cost, and therefore could be of great practical relevance. The second major motivation comes from a theoretical point of view. Although a central limit theorem for the sample variance of an i.i.d. Monte Carlo estimate is known (can be easily established via delta method, for example), no Markov chain Monte Carlo analogue of this result is available. Our paper provides an answer to this yet-to-be-addressed theoretical question. The proof is quite involved and leverages operator theory and the martingale central limit theorem [see, e.g.,1], as opposed to the Brownian motion based approach adopted in [11], and the result is analogous to the classical CLT for sample variance in the i.i.d. Monte Carlo case. The remainder of this article is organized as follows. In Section2 we state and prove the main central limit theorem along with a few intermediate results. Section3 provides two illustrations of the CLT – one based on a toy example (Section 3.1), and one based on a real world example (Section 3.2). Proofs of some key propositions and intermediate results are provided in the Appendix.
2. A Central Limit Theorem for Batch-Means Estimator
This section provides our main result, namely, a central theorem for the non-overlapping batch-means stan- dard error estimator. Before stating the theorem, we fix our notations, and review some known results on
Markov chains. Let (Xn)n≥1 be a Markov chain on (X , F, ν) with Markov transition density k(·, ·), and stationary measure Π (with density π). We denote by K(·, ·), the Markov transition function of (Xn)n≥1; in R 0 0 particular, for x ∈ X and a Borel set A ⊆ X , K(x, A) = A k(x, x ) dx . For m ≥ 1, the associated m-step Markov transition function is defined in the following inductive fashion Z m m−1 0 0 K (x, A) = K (x ,A) K(x, dx ) = Pr(Xm+j ∈ A | Xj = x) p R
1 0 for any j = 0, 1,... , with K ≡ K. The Markov chain (Xn)n≥1 is said to be reversible, if for any x, x ∈ X the detailed balance condition π(x)K(x, dx0) = π(x0)K(x0, dx) Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 5 is satisfied. Also, the chain (Xn)n≥1 is said to be geometrically ergodic if there exists a constant κ ∈ [0, 1) and a function Q : X → [0, ∞) such that for any x ∈ X and any m ∈ {1, 2,... }
kKm(x, ·) − Π(·)k := sup |Km(x, A) − Π(A)| ≤ Q(x)κm. A⊆X
Let us denote by Z Z 2 2 2 L0(π) = g : X → R : Eπg = f(x) dΠ(x) = 0 and Eπg = g(x) dΠ(x) < ∞ . X X
2 This is a Hilbert space where the inner product of g, h ∈ L0(π) is defined as Z Z hg, hiπ = g(x) h(x) dΠ(x) = g(x) h(x) π(x) dν(x) X X p and the corresponding norm is defined by kgkπ = hg, giπ. The Markov transition function K(·, ·) determines a Markov operator; we shall slightly abuse our notation and denote the associated operator by K as well.
2 2 2 More specifically, we shall let K : L0(π) → L0(π) denote the operator that maps g ∈ L0(π) to (Kg)(x) = R 0 0 g(x )K(x, dx ). The operator norm of K is defined as kKk = sup 2 kKgk. It follows that X g∈L0(π):kgkπ =1 kKk ≤ 1. Roberts and Rosenthal [30] show that for reversible (self-adjoint) K, kKk < 1 if and only if the associated Markov chain (Xn)n≥1 is geometrically ergodic. The following theorem establishes a CLT for the batch means estimator of MCMC variance.
Theorem 2.1. Suppose (Xn)n≥1 is a stationary geometrically ergodic reversible Markov chain with state 8 space X and invariant distribution Π. Let f : X → R be a Borel function with Eπ(f ) > 0. Consider the 2 2 2 batch means estimator σˆBM,f =σ ˆBM,f (n, an, bn) of the MCMC variance σf as defined in (1.2). Let an and √ bn be such that an → ∞, bn → ∞ and an/bn → 0 as n → ∞. Then
√ 2 2 d 4 an σˆBM,f (n, an, bn) − σf −→ N 0, 2σf
2 where σf is the MCMC variance as defined in (1.1).
Remark 2.1 (Proof technique). Our proof is based on an operator theoretic approach, and relies on a careful manipulation of appropriate moments, and the martinagle CLT. Previous work in [5,7, 11] on
2 consistency ofσ ˆBM,f is based on a Brownian motion based approximation (see [11, Equation ??]). This leads to some differences in the assumptions that are required to prove the respective results. Note again that [5,7, 11] do not explore a CLT for the batch means estimator.
Remark 2.2 (Discussion of assumptions: Uniform vs. Geometric ergodicity, reversibility and moments). Our results require geometric ergodicity of the Markov chain, which in general is required to
2 guarantee CLT of the MCMC estimate f n itself. The consistency ofσ ˆBM,f in [5] and [7] have been proved under uniform ergodicity of the Markov chain, which is substantially more restrictive and difficult to justify Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 6 in practice. On the other hand, [11] consider a Brownian motion based approach to prove their result. The consistency result in [11] holds under geometric ergodicity, however, verifying a crucial Brownian motion based sufficient condition can be challenging when the chain is not uniformly ergodic. On the other hand, we require reversibility of the Markov chain which is not a requirement in [5,7, 11]. Note that the commonly used Metropolis-Hastings algorithm and its modern efficient extension, the Hamiltonian Monte Carlo algorithm, are necessarily reversible [12, 24]. Also, for any Gibbs sampler, a reversible counterpart can always be constructed through random scans or reversible fixed scans [2, 12], and a two-block Gibbs sampler is always reversible. We require the function f to have a finite eighth moment, while the consistency results in [7] assume the existence of twelfth moment and those in [11] assume moments of order 4 + δ + for some δ > 0 and > 0. Note again that the authors in [11] do not establish a CLT.
Remark 2.3 (Stationarity). It is to be noted that Theorem 2.1 assumes stationarity, i.e., the initial measure of the Markov chain is assumed to be the stationary measure. This is similar to the assumptions made in [7,6] for establishing consistency. A moderate burn-in or warm-up period for an MCMC algorithm is usually enough to guarantee stationarity in practice.
Remark 2.4 (Choice of an and bn). Consider the two practically recommended choices [10] (i) an =
√ √ 1/3 bn = n and (ii) an = bn = n as mentioned in the Introduction. Clearly, (i) satisfies the sufficient conditions on an and bn described in Theorem 2.1 and hence, batch means estimators based on this choice attains a CLT, provided the other conditions in Theorem 2.1 hold. On the other hand, (ii) does not satisfy the conditions in Theorem 2.1, and hence a CLT is not guaranteed with this choice. Small adjustments, such
−δ+2/3 δ+1/3 2/3 −δ 1/3 δ as an = n , bn = n for some small 0 < δ < 2/3, and an = n (log n) and bn = n (log n) for some (small) δ > 0, could be used to technically satisfy the sufficient condition, however, the resulting convergence in distribution may be slow (see the toy example in Section 3.1).
Before proving Theorem 2.1, we first introduce some notation, and then state and prove some intermediate
results. Suppose the Markov chain (Xn)n≥1 and the function f satisfy the assumptions made in Theorem 2.1.
2 Define Yi = f(Xi) − Eπf for i = 1, 2,... , and write the batch-means estimatorσ ˆBM,f in (1.2) as
an an ! 2 2 bn X 2 an bn X 2 2 σˆBM,f =σ ˆBM,f (n, an, bn) = Y k − Y = Y k − bnY . an − 1 an − 1 an k=1 k=1
−1 Pbn −1 Pan Here Y k := bn i=1 Y(k−1)bn+i, and Y := an i=1 Y k. We shall consider the related quantity
an 2 bn X 2 2 an − 1 2 σ˜BM,f := Y k − bnY = σˆBM,f (2.1) an an k=1 and call it the modified batch means estimator. The following two lemmas establish two asymptotic results on the modified batch means estimator. The first lemma proves asymptotic normality for the modified batch Chakraborty, Bhattacharya and Khare/CLT for batch means variance estimate 7 means estimator (with a shift) whenever an → ∞ and bn → ∞. Key propositions needed in the proof of this lemma are provided in the Appendix.
2 Lemma 2.1. Consider the modified batch means estimator σ˜BM,f as defined in (2.1). If an → ∞ and bn → ∞ as n → ∞, then
√ h 2 2 2 2 2i 4 an σ˜BM,f − σf − Eπ(˜σBM,f ) − Eπ bnY − σf −→ N(0, 2σf )
2 where σf is the MCMC variance as defined in (1.1).
Proof. First observe that
√ h 2 2 2 2 2i an σ˜BM,f − σf − Eπ(˜σBM,f ) − Eπ bnY − σf
√ 2 2 2 = an σ˜BM,f − Eπ(˜σBM,f ) − Eπ bnY
an an ! √ bn X 2 2 bn X 2 = an Y k − bnY − Eπ Y k an an k=1 k=1 an bn X 2 2 √ 2 = √ Y k − Eπ Y k − an bnY an k=1 an bn X n 2 2 2 2 o = Y k − E Y k | Fk−1 + E Y k | Fk−1 − Eπ Y k an k=2 bn 2 2 √ 2 + √ Y1 − Eπ Y1 − an bnY an
an bn X n 2 2 o = Y k − E Y k | Fk−1 + h X(k−1)bn − Eπh X(k−1)bn an k=2 bn 2 2 √ 2 + √ Y1 − Eπ Y1 − an bnY . (2.2) an
Here, for 1 ≤ k ≤ an, Fk,n is the sigma-algebra generated by X1,...,Xkbn , and
2 2 h(X(k−1)bn ) := E Y k | Fk,n = E Y k | X(k−1)bn
˜ 2 due to the Markovian structure of (Xn)n≥1. Let h = h − Eπh ∈ L0(π). Since the Markov operator K has operator norm λ = kKk < 1 (due to geometric ergodicity), it follows that I − K is invertible (using, e.g., the
−1 P∞ j bn bn expansion (I −K) = j=0 K ). Therefore, I −K is also invertible, since K is also a Markov operator. Consequently, one can find ag ˜ such thatg ˜ = (I − Kbn )−1h˜, i.e., h˜ =g ˜ − Kbn g˜. Then