arXiv:2106.06894v1 [math.ST] 13 Jun 2021 cnwegsspotfo rnsNFDB1423adNFCC NSF and DEB-1840223 NSF gr SAMSI.. grants Program from Science Frontier support Human Nati acknowledges as and well DK116187-01 as R01 16-13261, grant DMS Natio and from 17-13012, support DMS acknowledges DEB-1840223, SM discussions. for tingly ncnrs,Bysa neec eurskoldeo h l data the observed of the as knowledge fits to requires process used inference model be Bayesian can a contrast, that by In function generated loss sequence infe a a Bayesian of generalized space existence In the state require continuous dynamics. time and processes continuous stochastic discrete flexibl of and variety a both a provide including to to applied systems is be paper can devia that this large work of a contribution on central based systems The dynamical certain for ference nti ok epoieaypoi eut o (generalize for results asymptotic provide we work, this In SadS ol iet hn ei co,Ade oe,adJo and Nobel, Andrew McGoff, Kevin thank to like would SM and LS AG EITO PRAHT POSTERIOR TO APPROACH DEVIATION LARGE A eie sasohsi process stochastic by a parameterized as terized processes model of class esaecniin ntemdlfamily model the on conditions state we st td h smttcbhvo of behavior asymptotic the study to is distribution posterior generalized the nalredvainapoc.Gvnasqec fobservati of sequence a Given approach. sy deviation dynamical large certain a for on inference Bayesian (generalized) Abstract. hc esrsteerrbetween error the measures which osadadvrec em ec rvn otro consist posterior proving hence th term, minimize divergence that a distribut parameters and posterior loss those generalized on shift the asymptotically on that trates show processes also systems Gibbs We dynamical and of type. processes classes hypermixing different time very two ous to it apply we f behavior deviation large conditional a (1) X are: require we o h a rmteparameter the from map the for h bevto sequence observation the L Sddctsti ohsfte eb ufrhsendurance. his for Su Weibo father his to this dedicates LS θ uhta h otro itiuincnegs h w con two The converges. distribution posterior the that such OSSEC NDNMCLSYSTEMS DYNAMICAL IN CONSISTENCY n 2 nepnnilcniut odto vrtemodel the over condition continuity exponential an (2) and , AGUNS N AA MUKHERJEE SAYAN AND SU LANGXUAN nti ae,w rvd smttcrslsconcerning results asymptotic provide we paper, this In 1. y h rpsdfaeoki ut general, quite is framework proposed The . Introduction X θ θ rameasure a or y otels nurdbetween incurred loss the to 1 n elzto of realization a and π π t t ( ( { θ θ µ θ | | θ y y } ∈ θ as ) .Tega fti paper this of goal The ). ∈ µ hc a echarac- be can which Θ Θ θ n osfunction loss a and , t n h osfunction loss the and a cec onaingrants Foundation Science nal ∞ → nlIsiue fHealth of Institutes onal n G05/07 LS RGP0051/2017. ant X . nparticular, In θ tm based stems ency. -946 swl as well as F-1934964 especify we , o concen- ion rasingle a or expected e klho rthe or ikelihood continu- : ffinite of s in approach. tions )Bysa in- Bayesian d) n dynamical and ons ec,w only we rence, ro frame- proof e n discrete and s X ahnC Mat- C. nathan eshwwell how sess ditions family θ y and a , sequence. L 2 LANGXUANSUANDSAYANMUKHERJEE data generation process, and it is generally assumed that the data generation process belongs to the family of candidate model processes. We consider gen- eralized Bayes procedures for two reasons: (1) the large deviation approach to Bayesian inference will focus on properties of the loss function, and (2) when studying dynamical or other complex systems, it may be unrealistic to assume knowledge of the data generating system and difficult to verify that the data generating process is in the model class specified. The elements of our model include a sequence of observations y, a class of model processes parameterized by θ ∈ Θ which can be characterized as a θ X or a measure µθ, and a loss function L which measures the error between y and a realization of Xθ. These elements together allow us to specify the generalized posterior distribution πt(θ | y). The goal of this paper is to study the asymptotic behavior of πt(θ | y) as t → ∞. In particular, we state conditions on the model family {µθ}θ∈Θ and the loss function L such that the posterior distribution converges. The two conditions we require are: (1) a conditional large deviation behavior for a single Xθ, and (2) an exponential continuity condition over the model family for the map from the parameter θ to the loss incurred between Xθ and the observation sequence y. The motivations for proposing the large deviation framework developed in this paper are simplicity and generality. We simplify proving posterior consistency to checking for two properties which have been well studied in the large deviations literature. As the proposed framework can be applied to a wide class of processes, we are providing new proof techniques that can be used to analyze posterior consistency for novel stochastic models. In this paper, we provide some evidence of the generality of our procedure via applications to continuous time hypermixing processes and Gibbs processes on shifts of finite type. The point is that the same procedure for proving posterior consistency can be used for two very different dynamical systems.

1.1. Connections to previous work. Large deviations have been applied to Bayesian procedures including posterior distributions of i.i.d. random variables [30], Markov chains [28,51], Dirichlet processes [31], and empirical likelihoods [34]. These papers do not specifically consider posterior con- sistency and do not provide a general methodology for proving posterior consistency, rather they focus on deriving the large deviation for a particular Bayesian procedure. Posterior consistency for i.i.d. processes is well studied, going back to ini- tial results by Doob [23] and Schwartz [56]. Counterexamples and challenges to proving posterior consistency for nonparametric models were highlighted by Diaconis and Freedman in [17]. Posterior consistency for Bayesian non- parametric models is an active area of research, for a detailed review see [33]. POSTERIORCONSISTENCYINDYNAMICS 3

There are far fewer results for posterior consistency for dependent pro- cesses or dynamical systems. Posterior consistency for hidden Markov mod- els was established in [13, 24, 25, 32, 61]. For dynamical systems, [45] estab- lished posterior consistency of (hidden) Gibbs processes on mixing subshifts of finite type using properties of Gibbs measures, and [43] directly used large deviation results to prove a similar result. The setting of our paper is similar to [58], where the author establishes posterior consistency by proving the large deviation principle for the poste- rior distribution of some dependent processes with likelihood ratios instead of loss functions. There are technical difficulties in verifying the main as- sumptions in [58], in particular when the state space is neither finite nor compact. In addition, [58] does not provide any procedures to check the required assumptions. Our work can be seen as a general framework for ver- ifying those assumptions. Additionally, the methods used in both our paper and [58] are similar to the -based argument in [46], which originates from [24]. From a large deviations perspective, we will make extensive use of two techniques in large deviations theory. We will require large deviation be- havior on each model process conditioning on a given observation y. In the large deviation literature, this is closely related to the conditional large de- viation principle or the quenched large deviation principle which has been studied for a variety of stochastic models [11, 14, 57]. We will also require a regularity condition on the map from parameters to model processes called exponential continuity, which is adapted from the work on the large devia- tion of mixtures or the annealed large deviation principle. This formalism has been widely studied for a variety of processes [3, 14, 18, 64], with ex- changeable sequences the most notable from the literature [18,64]. The idea of exponential continuity we use comes from [18], and it is also implicit in the study of uniform large deviation principles of Markov processes with different initial conditions in [6, 26]. The idea of a variational formulation of Bayesian inference was developed by Zellner [66], and the link between and with Bayesian inference was at the heart of the inference framework advocated by Edwin T. Jaynes [37]. In the variational perspective Bayesian inference is an optimization procedure over distributions that minimizes an error term and a regularization term that enforces the posterior to be close to the prior. Generalized Bayes inference refers to procedures for updating prior beliefs where the loss may not necessarily be the log-likelihood and the regularization term need not be the relative entropy to the prior. The idea of using loss functions to update beliefs goes back at least to Vovk [62]. It has played a central role in the PAC-Bayesian approach to statistical learning [8,44] and has been adopted by the mainstream Bayesian community [4,48]. In [35] consistency and rates of convergence are obtained for generalized Bayesian methods. 4 LANGXUANSUANDSAYANMUKHERJEE

1.2. Overview. In Section 2, we provide notations, the setup of our prob- lem and the main results of the paper. In Section 3, we outline our large deviation framework to prove posterior consistency and provide both details and examples for the steps required and the various mathematical quan- tities involved. In Section 4, we apply our large deviation framework to two very different dynamic models: continuous time hypermixing stochas- tic processes, and Gibbs processes on shifts of finite type. We close with a discussion. We also provide appendices for deferred proofs, background material and auxiliary technical results.

2. Observed System and Model Processes Our inference framework consists of two components. The first compo- nent is the dynamical system that generated the observations. The second component is a parameterized family of stochastic processes for which (gen- eralized) Bayesian updating provides a posterior distribution conditioning on the observations. We will denote T as the time index. In this work, we consider both discrete T = N time indices, equipped with discrete topology, and continuous T = [0, ∞) time indicies, equipped with Euclidean topology. In both cases, T is equipped with its Borel σ-algebra.

2.1. Observed System. Let Y be a Polish space equipped with the Borel σ-algebra. We denote an one-parameter family of measurable transforma- t t tions acting on Y by (Ψ )t∈T, i.e., for each t ∈ T,Ψ : Y →Y is a mea- surable map, the map T ×Y →Y, (t,y) 7→ Ψty is jointly measurable, and Ψt+s = Ψt ◦ Ψs. We recall a few definitions from ergodic theory. A Borel probability measure ν on Y is said to be Ψ-invariant if ν((Ψt)−1E)= ν(E) for any t ∈ T and any measurable set E ⊂ Y. A set E ⊂ Y is said to be Ψ-invariant if (Ψt)−1(E) = E for any t ∈ T. A Ψ-invariant Borel probabil- ity measure ν is called Ψ-ergodic if ν(E) ∈ {0, 1} for any Ψ-invariant set E ⊂ Y. A function f : Y → [−∞, ∞] is called Ψ-invariant if for ν-almost every y ∈Y, f(y)= f(Ψty) for any t ∈ T. In the following, we assume that observations come from a dynamical system (Y, Ψ,ν) where ν is Ψ-ergodic. In the continuous time case T = [0, ∞), we make an additional assumption t that (Ψ )t∈T is an one-parameter family of continuous transformations so that the map (t,y) 7→ Ψt(y) is continuous. This ensures the integrals in time in this paper are all well defined.

2.2. Parameterized Model Processes. As we consider Bayesian infer- ence in this paper, our objective is to obtain a posterior distribution over a parameterized family of processes which quantifies the evidence that each of the model processes in the family could have generated the observed data. θ We specify the family of processes as (Xt )t∈T indexed by parameter θ ∈ Θ and our inferential goal is to obtain a posterior distribution πt(θ | y) given observations y up to time t. Before we rigorously state the form of the POSTERIORCONSISTENCYINDYNAMICS 5 posterior, we formally specify the general form of the model processes we consider in this paper. Let S be a Polish space. Let X = C(T, S) be the space of continuous paths x : t 7→ xt on S equipped with the topology of uniform convergence on compact intervals of T, so X is also a Polish space under this topology with a compatible metric dX . The natural one-parameter family of contin- t uous transformations acting on X is the family of time shift maps (Φ )t∈T t t given by Φ : X →X , (Φ (x))s = xt+s. Thus, X can be viewed as the canonical probability space of continuous S-valued stochastic processes. Let Θ be a compact metric space of parameters with the metric denoted by dΘ. One can view the family of model processes as a family of Φ-invariant measures {µθ}θ∈Θ on X , perhaps a more dynamical perspective. A more probabilistic view considers a parameterized family of S-valued stationary θ θ stochastic processes {X = (Xt )t∈T}θ∈Θ with corresponding laws {µθ}θ∈Θ on X . Both perspectives will be considered in this paper. From the proba- bilistic view point, we have the freedom to choose the underlying probability space (Ω, F, P) of the stochastic process Xθ.

2.3. Posterior Inference. In Bayesian inference, the quantity of interest is the posterior distribution on the parameters given our observation sequence y up to time t, Likt(y | θ) × π (θ) π (θ | y) = 0 , t t ′ ′ ′ θ′∈Θ Lik (y | θ ) × π0(θ ) dθ where π0(θ) is the prior distributionR or the belief over the parameters θ ∈ Θ, Likt(y | θ) is the likelihood of observing the data y given parameter θ up to time t, and for the purpose of this paper the denominator is a normalizing constant (in Bayesian inference it is the marginal probability of the data y). Another formulation of Bayesian inference focuses on the loss function lt(y; θ) = − log Likt(y | θ) which measures the incurred error or loss on the data y given a parameter θ up to time t. The posterior in this setting is t exp(−l (y; θ)) × π0(θ) πt(θ | y) = t ′ ′ ′ . θ′∈Θ exp(−l (y; θ )) × π0(θ ) dθ We will focus on the aboveR loss-based formulation of Bayesian inference because it makes more transparent the connections between large deviation theory and posterior consistency, and it allows us to provide results for the wider class of generalized Bayesian updating procedures. We consider a general loss function L :Θ ×X×Y → R, (θ,x,y) 7→ Lθ(x,y), where the path x ∈ X can be interpreted as the underlying state of the observed data. We define the (integrated) loss up to time t for the continuous and discrete cases as t t−1 t s s t s s Lθ(x,y) := Lθ(Φ x, Ψ y) ds, Lθ(x,y) := Lθ(Φ x, Ψ y). 0 Z Xs=0 6 LANGXUANSUANDSAYANMUKHERJEE

We now formally state the posterior distribution of interest in this paper. We assume the space of parameters Θ is equipped with its Borel σ-algebra and a fully supported prior measure π0. Given an observation y, the posterior distribution is 1 (2.1) π (E | y)= exp −Lt (x,y) dµ (x) dπ (θ), t Z π0 (y) θ θ 0 t ZE×X where E ⊆ Θ and the normalization constant or partition function is

Z π t ′ ′ (2.2) t (y)= exp −Lθ′ (x,y) dµθ (x) dπ(θ ), ZΘ×X  when π is any Borel probability measure on Θ. Again, when Lθ(x,y) is the negative log likelihood of y given x and θ, one recovers Bayesian inference. The goal of this work is to study the asymptotic behavior of the posterior distribution πt(· | y) as t →∞. In particular, we want to understand under what conditions on the model family {µθ}θ∈Θ and the loss function L, the posterior distribution πt(· | y) converges, and around what set of parameters it concentrates asymptotically. 2.4. Main Results. We develop a large deviation framework to study the asymptotic behavior of the posterior distribution πt(· | y). We introduce the notion of exponentially continuous family for describing the regularity of the parametrization map θ 7→ µθ and prove the following variational characterization of expoential asymptotics of the partition function.

Theorem 2.1. Suppose {µθ}θ∈Θ is an exponentially continuous family with respect to the loss function L. Then there exists a continuous function V : Θ → R such that for any Borel probability measure π on Θ, for ν-almost every y ∈Y, 1 Z π lim − log t (y)= inf V (θ). t→∞ t θ∈supp(π) As a corollary from [45, Theorem 2], we obtain the convergence of the pos- terior distribution πt(· | y) as t →∞ and characterize the set it concentrates on asympotically.

Corollary 2.1.1. Let Θmin be the set of minimizers of V in Θ. Then if U is an open neighborhood of Θmin, for ν-almost every y ∈Y, we have

lim πt(Θ \ U | y) = 0. t→∞ To generalize previous results on posterior consistency and demonstrate the flexibility of our framework, we provide natural sufficient conditions of exponential continuity for two classes of very different model processes: (1) a class of continuous-time, dependent processes on general state spaces called hypermixing processes; (2) a class of discrete-time, discrete-valued dynam- ical systems called Gibbs processes on shifts of finite type. For these two classes of examples, we also characterize V explicitly so that the posterior distribution concentrates asymptotically to those parameters that minimize POSTERIORCONSISTENCYINDYNAMICS 7 the sum of the expected loss and a divergence term, thus proving posterior consistency.

2.5. Notation. Suppose Z is a Polish space equipped with the Borel σ- t algebra and (R )t∈T is an one-parameter family of (continuous if T = [0, ∞)) transformations on Z. The following objects appear throughout the paper.

Measures: Let M(Z) denote the vector space of finite Borel measures on Z, M1(Z) the subset of Borel probability measures on Z, and M1(Z,R) the subset of R-invariant Borel probability measures. The topology we consider is the weak . In particular, M1(Z) is a Polish space under this topology. Let Mν (X ×Y) denote the set of Borel probability measures on X ×Y with Y-marginal ν. For z ∈ Z, we write the of z on Z in continuous and discrete time as t−1 1 t 1 M (z) := δ s ds, M (z) := δ s , t t R z t t R z 0 s=0 Z X where δz is the Dirac Delta measure at z.

Sigma algebras: Consider X and Y as canonical probability spaces for the stochastic process X = (Xt)t∈T and the Y (defined by Xt : x 7→ xt and Y : y 7→ y). We define the σ-algebra Ft = σ(Xs : 0 ≤ s ≤ t) on X as generated by X up to time t and Ft = σ(Xs,Y : 0 ≤ s ≤ t) on X ×Y as generated by X up to time t with full information of Y . e Functions: Let Cb(Z) denote the set of bounded continuous functions f : Z → R, Cb,loc(X ) the set of functions f ∈ Cb(X ) such that f is Fr- measurable for some r ≥ 0, and Cb,loc(X ×Y) the set of functions f ∈ Cb(X ×Y) such that f is Fr-measurable for some r ≥ 0. For a continuous function f : Z → R and interval I ⊂ T, we write in continuous time and discretee time, respectively,

f I (z) := f(Rsz) ds, f I(z) := f(Rsz) I Z Xs∈I and we write f t := f [0,t], f t := f {0,1,...,t−1}, respectively. We denote the sup norm of f by kfk.

Joint processes: Let J (Φ : ν) denote the set of (Φ×Ψ)-invariant probability measures with Y-marginal ν, and Je(Φ : ν) the (Φ × Ψ)-ergodic elements of J (Φ : ν).

3. A Large Deviation Approach In this section, we present a framework of proving posterior convergence using a large deviation approach. We start with the observation in [45] that 8 LANGXUANSUANDSAYANMUKHERJEE there is a variational characterization that implies posterior consistency. Starting with this observation, the large deviation approach to posterior consistency is based on showing the following: (1) a conditional large deviation behavior of some empirical process on X ×Y; this allows us to prove the variational characterization for a single model process; (2) an exponential continuity condition over the model family, adapted from the large deviation theory of mixtures; this allows us to prove the variational characterization over the entire model family. 3.1. A Variational Formulation for Posterior Convergence. A cen- tral idea in [45] was that proving posterior consistency can be reduced to proving a variational characterization of the asymptotics of the normalizing Z π constant t at an exponential scale. The results in [45] were for a specific model family, Gibbs processes on mixing subshifts of finite type. A key point in this paper is that the variational characterization of the asymp- totics of normalizing constant at an exponential scale will hold for a wide class of processes, and this characterization can be used to prove posterior consistency. We start by restating Theorem 1 and 2 in [45] using the notation for model processes in our paper. Theorem 3.1. Suppose there exists a lower semicontinuous function V : Θ → R such that for any Borel probability measure π on Θ, for ν-almost every y ∈Y, 1 Z π (3.1) lim − log t (y)= inf V (θ). t→∞ t θ∈supp(π)

Let Θmin be the set of minimizers of V in Θ. Then if U is an open neigh- borhood of Θmin, for ν-almost every y ∈Y, we have

lim πt(Θ \ U | y) = 0. t→∞ In other words, to show posterior consistency, it suffices to prove (3.1) for any probability measure π on Θ. Moreover, Theorem 3.1 reveals that the generalized posterior distribution concentrates asymptotically on minimizers of the function V . A crucial observation in [45] was that V can be explicitly stated when the model processes are Gibbs processes on mixing subshifts of finite type:

(3.2) V (θ)= inf Lθ dλ + h(λ k µθ ⊗ ν) , λ∈J (Φ:ν) Z  where Lθ is the map (x,y) 7→ Lθ(x,y), J (Φ : ν) are the joint processes defined before, and h(λ k µθ ⊗ ν) measures the divergence of λ from µθ ⊗ ν, defined explicitly later in this paper. Intuitively, when a parameter θ is given without further information of the dynamics on X and Y, the natural guess for the joint distribution on X and Y is the independent coupling µθ ⊗ ν. POSTERIORCONSISTENCYINDYNAMICS 9

From this perspective, we may interpret the parameters in Θmin. We provide a short proof of Theorem 3.1 in Appendix A.1. The proof is very similar to the proof of in Theorem 2 in [45], but we added it to provide a proof that uses our notation and to be self-contained. 3.2. The Process-level LDP and the Relative Entropy Rate. In this subsection, we provide a variational characterization of the partition func- tion for a single parameter θ as a direct application of large deviation theory. More specifically, we state the partition function specified in (2.2) for a fixed parameter θ ∈ Θ (consider π = δθ on Θ):

Z θ Z δθ t (3.3) t (y) := t (y)= exp −Lθ(x,y) dµθ(x). ZX Then the variational characterization (3.1) for a fixed θ reduces to 1 (3.4) lim log Z θ(y)= −V (θ), t→∞ t t where V (θ) has the form of (3.2). We will show that (3.4) is implied by some large deviation principle. The large deviation principle quantifies asymptotic behavior of probability measures on an exponential scale through the rate function via the following definition. Definition 3.1 (Large Deviation Principle (LDP)). Let Z be a Polish space and I : Z → [0, ∞] be a lower semicontinuous function. A family (ηt)t∈T of probability measures on Z is said to satisfy the large deviation principle with rate function I if for every closed set E ⊂Z, 1 (3.5) lim sup log ηt(E) ≤− inf I(z), t→∞ t z∈E and for every open set U ⊂Z, 1 (3.6) lim inf log ηt(U) ≥− inf I(z). t→∞ t z∈U

In particular, we say that a stochastic process (Zt)t∈T on Z satisfies the large deviation principle with rate function I if the law of (Zt)t∈T on Z does. Note that the variational characterization of the scaled log partition func- tion in equation (3.1) takes a similar form as the above LDP. The connection is even more explicit when we consider Varadhan’s Lemma [6, Theorem 1.5] which states the following equivalent definition of the large deviation prin- ciple. Theorem 3.2 (Varadhan’s Lemma). Let Z be a Polish space and I : Z → [0, ∞] be a lower semicontinuous function. A stochastic process (Zt)t∈T on Z satisfies the large deviation principle with rate function I, then it satisfies the Laplace principle, i.e., for all F ∈ Cb(Z), we have 1 (3.7) lim log E [exp(−tF (Zt))] = − inf (F (z)+ I(z)). t→∞ t z∈Z 10 LANGXUANSUANDSAYANMUKHERJEE

In fact, the large deviation upper bound (3.5) implies the “ ≤ ” part of (3.7), while the lower bound (3.6) implies the “ ≥ ” part of (3.7). Varadhan’s Lemma states that the large deviation principle implies the Laplace principle. Actually, the large deviation principle and the Laplace principle are equivalent if the rate function I has compact sublevel sets, see [6, Theorem 1.8]. We now specify a collection of stochastic processes for which the varia- tional characterization (3.4) of the single process partition function can be stated as a direct application of Varadhan’s Lemma. We set Z = M1(X×Y) and the stochastic process (Zt)t∈T of probability measures on (X×Y, Φ×Ψ) as

t t−1 θ 1 1 Z = M (X ,y) := δ s θ s ds or δ s θ s t t t (Φ X ,Ψ y) t (Φ X ,Ψ y) 0 s=0 Z X θ for a fixed observation y ∈ Y and the model process (Xt )t∈T ∼ µθ corre- sponding to a fixed parameter θ ∈ Θ. Next, we pick a specific function F that is defined by

(3.8) F : M1(X ×Y) → R, η 7→ Lθ(x,y) dη(x,y), ZX×Y where θ is the same fixed parameter. By plugging in the Zt and F into the left-hand side of equation (3.7), we recover the partition function

t Z θ E [exp(−tF (Zt))] = exp −Lθ(x,y) dµθ(x)= t (y). ZX θ  Hence, if (Mt(X ,y))t∈T satisfies the large deviation principle on M1(X×Y) with rate function Iθ : M1(X ×Y) → [0, ∞] given by

h(λ k µθ ⊗ µ), if λ ∈J (Φ : ν), Iθ(λ)= (∞, otherwise, then by Varadhan’s Lemma, we obtain the desired variational characteriza- tion for a fixed θ:

1 Z θ lim log t (y)= − inf Lθ dλ + h(λ k µθ ⊗ ν) = −V (θ). t→∞ t λ∈J (Φ:ν) Z  In the remainder of this subsection, we define rigorously the divergence term h(· k ·) with examples for some stochastic processes and dynamical systems. Remark 3.1. The large deviation principle is stronger than (3.4), since it only requires (3.7) to hold for the one selected function (3.8) rather than all functions in Cb(Z). We will exploit this fact to prove part of the large deviation principle, it is sufficient to ensure (3.4) instead of the full large deviation principle. POSTERIORCONSISTENCYINDYNAMICS 11

We now outline the role of the divergence term as the rate function. We need to make sense of the divergence term h(· k µ ⊗ ν), where we set µ = µθ for a fixed parameter θ. We start with the definition of relative entropy, which quantifies the divergence between measures. Definition 3.2. Let Z be a Polish space. For λ, η ∈ M(Z), the relative entropy, also called the Kullback-Leibler divergence, of λ with respect to η is defined as log dλ dλ, if λ ≪ η, K(λ k η)= Z dη (R∞, otherwise, where λ ≪ η means that λ is absolutely continuous with respect to η, and dλ dη is the corresponding Radon-Nikodym derivative. The relative entropy plays a universal role in large deviation theory. See [6, 26] for derivations of many well-known LDPs using the relative en- tropy. For discrete time case, Sanov’s Theorem [29] establishes the LDP 1 t−1 for empirical measures ( t s=0 δXs )t∈N of independent and identically dis- tributed random variables X0, X1,...,Xn ∼ µ0 on S with rate function P function I(η) = K(η k µ0). In the case of empirical measures of discrete- time Markov chains [19–21], the rate function will be based on a relative entropy term involving both the Markov transition kernel as well as the initial stationary distribution. As the processes we are interested in can have arbitrary dependencies, and the dynamics can be deterministic, we will study the large deviation behavior of the whole empirical processes t−1 t−1 1 1 M (X)= δ s = δ , t t (Φ Xs) t (Xs,Xs+1,Xs+2,...) s=0 ! s=0 ! X t∈N X t∈N so we need to consider the role of the relative entropy on the whole process as a dynamical system. As discussed in [45], the relative entropy cannot be directly applied to measures on dynamical systems, as any two different ergodic measures are mutually singular the relative entropy will be infinite. We are in need of a more useful divergence for measures on the path space and on dynamical systems. Thus, we introduce the relative entropy rate on X , which usually appears as the rate function of the process-level LDPs, such as the LDP of the empirical process Mt(X) on X for well-behaved stochastic processes (Xt)t∈T. The relative entropy rate quantifies the divergence between mea- sures on the path space in a meaningful way. Definition 3.3 (The relative entropy rate on X ). For λ, η ∈ M(X ), we define the relative entropy rate of λ with respect to η as 1 h(λ k η) := lim K(λ|F k η|F ) t→∞ t t t if the limit exists and where λ|Ft and η|Ft are λ and η restricted on Ft. 12 LANGXUANSUANDSAYANMUKHERJEE

For many well-behaved stochastic processes (Xt)t∈T with some law µ on X , the relative entropy rate h(· k µ) is a well-defined. Nontrivial functions on M1(X ) with the empirical process Mt(X) satisfy an LDP with rate function I(λ)= h(λ k µ), see for instance [12, 29, 54, 57]. We introduce the relative entropy rate on X ×Y, we adapt [57, equation (3.7)] which states the process-level conditional LDP of Markov chains on a random environment which is analgous to our observation y. Definition 3.4 (The relative entropy rate on X×Y). For λ, η ∈ M(X×Y), we define the relative entropy rate of λ with respect to η as 1 h(λ k η) := lim K(λ| k η| ) t→∞ t Fet Fet if the limit exists. Now we are in a good position to restate the first step of our strategy: to θ show that for X ∼ µθ with any fixed θ, it holds for ν-almost every y ∈Y, θ (Mt(X ,y))t∈T satisfies the large deviation principle on M1(X ×Y) with well-defined rate function Iθ(η) = h(η k µθ ⊗ ν), or a weaker result that is sufficient to guarantee (3.4). The previous statement can be interpreted in the following way: we have hope of proving posterior consistency as long as our model family {µθ}θ∈Θ corresponds to processes with good large deviation behavior. 3.2.1. Examples of the Relative Entropy Rate. For concreteness, we include a few examples, where the relative entropy rate on X is a closed form ex- pression. The second and third examples below come from [27, Appendix A], where computations are shown in full details. The last example comes from [9, Exmaple 1]. Example 3.1 (I.I.D. processes). Consider two discrete-time i.i.d. processes (Xt)t∈N, (Zt)t∈N on S, where for each t ∈ N, Xt ∼ µ0 and Zt ∼ η0 in ⊗N S, respectively, and the whole process (Xt)t∈N, (Zt)t∈N have law µ = µ0 , ⊗N N η = η0 on the path space X = S , respectively. Then the relative entropy rate of η with respect to µ is h(η k µ)= K(η0 k µ0).

Example 3.2 (Discrete-time Markov chains). Let (Xt)t∈N, (Zt)t∈N be two Markov chains on S with transition kernels p and q, and stationary initial measures µ0 and η0 on S, respectively. Let µ, η be the law of (Xt)t∈N, N (Zt)t∈N on the path space X = S , respectively. Then the relative entropy rate of η with respect to µ is h(η k µ)= K(η0 ⊗q k µ0 ⊗p), where µ0 ⊗p is the probability measure on S ×S defined by (µ0 ⊗p)(A×B)= A p(x,B) dµ0(x). Example 3.3 (Stochastic differential equations). ConsiderR two Itˆodiffusion processes on Rd:

dXt = a(Xt)dt + σ(Xt)dWt,

dZt = b(Zt)dt + σ(Zt)dWt, POSTERIORCONSISTENCYINDYNAMICS 13 where a, b are vector fields on Rd and σ(x) ∈ Rd×d is non-singular so that each of the stochastic differential equations above have unique global weak solutions, given stationary initial conditions X0 ∼ µ0 and Z0 ∼ η0. Let µ, η be the law of (Xt)t∈[0,∞), (Zt)t∈[0,∞) on the path space X = C([0, ∞), Rd), respectively. Under some assumptions for Novikov’s con- dition, one can show the relative entropy rate of η with respect to µ is 1 E 2 T h(η k µ) = 2 x∼η0 ka − bkΣ−1 , where Σ := σσ is the diffusion matrix and d −1 kb(x)kΣ−1(x) := i,j=1 bi(x)Σij (x)bj(x).

Example 3.4 (ShiftsP of finite type). Let T = N and S be a finite alphabet. ⊗T Let σ0 ∈ M1(S) be the uniform probability measure, and σ = σ0 ∈ M1(X ) be the product measure on X . Let η ∈ M1(X , Φ) be an arbitrary Φ-invariant measure. Then the relative entropy rate of η with respect to σ is h(η k σ)= h(σ) − h(η) = log |S| − h(η), where h(η) denotes the measure-theoretic or Kolmogorov-Sinai entropy of η.

Remark 3.2 (The relative entropy rate in ergodic theory). From the er- godic theory point of view, one may view the σ-algebras Ft as measurable partitions consisting of cylinder sets of the path space X . From this perspec- tive, the relative entropy rate on many dynamical systems can be related to other well-studied quantities in ergodic theory, such as the Kolmogorov- Sinai entropy (as in Example 3.4), Lyapunov exponents, and equilibrium states. See Section 3.5.3 for a review of related work.

3.3. The Exponentially Continuous Model Family and the Loss Function. So far, we have a strategy for proving the variational character- ization (3.4) for a single parameter θ using large deviation techniques. It remains to push (3.4) to the variatonal characterization (3.1) on the support of any probability measure π on the parameter space Θ. In this case, we need to study the large deviation behavior of the mixture µθ dπ(θ) of prob- ability measures {µθ}θ∈Θ mixing by π. Intuitively, to recover (3.1) based on the previous reasoning using Varadhan’s Lemma, weR want the process- level LDP to hold for the empirical process of µθ dπ(θ) with rate function inf I , where I is the rate function for the process-level LDP of θ∈supp(π) θ θ R µθ. This observation is reminiscent of the work on large deviations of mix- tures [3, 18, 64]. The key property required for large deviations of mixtures is exponential continuity on the parametrization map θ 7→ µθ. The intuition for exponential continuity is that if θt −→ θ, then {µθt }t is an exponential approximation of µθ in the language of large deviation theory [15, Section

4.2.2] so that {µθt } has the same large deviation asymptotics as µθ as t →∞. In our case, we consider exponential approximations of the empirical pro- cesses. We adapt the definition of exponential continuity to our setting.

Definition 3.5 (Exponentially continuous family). We say {µθ}θ∈Θ is an exponentially continuous family with respect to L if the following holds: for 14 LANGXUANSUANDSAYANMUKHERJEE all θ ∈ Θ, it holds for ν-a.e. y ∈Y that the following limit exists

1 t (3.9) lim log exp −L (x,y) dµθ(x) =: −V (θ), t→∞ t θ ZX  and if (θt)t∈T is a family of parameters such that θt −→ θ in Θ, then

1 t (3.10) lim log exp −L (x,y) dµθ (x)= −V (θ). t→∞ t θt t ZX  For clarity of exposition, we focus on the loss function L with the following assumption. Weaker assumptions are possible but will complicate theorem statements and are not as natural when the state space S is non-compact. Assumption 3.1 (The loss function). We assume that the loss function L :Θ ×X×Y→ R, (θ,x,y) 7→ Lθ(x,y) satisfies the following properties:

(1) For all θ ∈ Θ, the map Lθ : (x,y) 7→ Lθ(x,y) belongs to Cb,loc(X×Y). (2) For any θ,θ′ ∈ Θ, x,x′ ∈X and y ∈Y, ′ ′ ′ |Lθ(x,y) − Lθ′ (x ,y)|≤ ω(dθ(θ,θ )) + ω(dX (x,x ))

for some function ω : [0, ∞) → [0, ∞) with limr→0+ ω(r)= ω(0) = 0, which can be viewed as the modulus of continuity. The natural class of loss functions we are considering is the following.

2 Example 3.5. Let d¯Y ∈ Cb(Y ) be a bounded metric and ϕ :Θ ×X → Y, (θ,x) 7→ ϕθ(x) be a uniformly continuous function such that for every θ ∈ Θ, the map ϕθ : X →Y, x 7→ ϕθ(x) is Fr(θ)-measurable for some r(θ) > 0. Then the loss function L defined by Lθ(x,y) := d¯Y (ϕθ(x),y) satisfies Assumption 3.1. Remark 3.3. Note that every metric is equivalent to a bounded metric. The assumption that Lθ belongs to Cb,loc(X ×Y) is practical, since if the loss Lθ(x,y) depends on the whole path of x, then it is more difficult for real-world computations. As an intermediate step of proving exponential continuity, it is convenient to show that given a fixed θ ∈ Θ, for ν-a.e. y ∈ Y, if (θt)t∈T is a sequence in Θ such that θt −→ θ, then there exists a probability space (Ω, F, P) such that for all ǫ> 0, 1 1 (3.11) lim sup log P (Lt (Xθt ,y) − Lt (Xθ,y)) >ǫ = −∞. t t θ θ n→∞  

The expression (3.11) is basically the same as the definition of exponential approximations [15, Definition 4.2.10]. To show (3.11), we will see that it requires the continuity of the map θ 7→ µθ with respect to the topology on Θ and the weak topology of measures, and this continuity requirement may impose nontrivial restrictions on the family of dynamics. POSTERIORCONSISTENCYINDYNAMICS 15

Once the exponential continuity of {µθ}θ∈Θ is established, the continuity of V and the variational characterization (3.1) needed for posterior con- sistency follow immediately. In particular, posterior consistency holds by Theorem 3.1.

Lemma 3.3. If {µθ}θ∈Θ is an exponentially continuous family with respect to the loss function L, then V is a continuous function.

Proposition 3.4. Suppose {µθ}θ∈Θ is an exponentially continuous family with respect to the loss function L and π is a Borel probability measure on Θ. Then for ν-almost every y ∈Y, 1 Z π lim − log t (y)= inf V (θ). t→∞ t θ∈supp(π) We defer the proofs to Appendix A.2 as they adapt known arguments. Remark 3.4 (Continuity of V ). Although Theorem 3.1 only requires lower semicontinuity of V , the continuity of V is actually of significance. In [45], the authors only required lower semicontinuity on V as the focus was on the dependency of the divergence term Iθ(λ) = h(λ k µθ ⊗ ν) on θ, which is generally only lower semicontinuous. However, if one considers both terms in V , infλ(F (λ)+ Iθ(λ)), the expression is usually continuous in θ. This is an advantage of the Laplace principle mentioned in Theorem 3.1 compared to the usual LDP, which is exploited for proving uniform LDPs in [6, Proposition 1.12]. The continuity of V resolves a subtle discrepancy between the analoguous posterior consistency results of [58] and [45]. More specifically, the posterior distributions in [58] converge to the essential minimizers (those that achieve the π0-essential infimum) of V instead of the actual minimizers of V , while there is no difference if V is continuous. 3.4. A Summary of the Strategy. We summarize our strategy of proving posterior consistency for a given model family {µθ}θ∈Θ and a loss function L. The guiding principle is that if each member µθ of the model family has good large deviation behavior such as satisfying the process-level LDP, then we have hope to prove posterior consistency. The strategy is consists of the following two steps: θ θ (1) Extend the LDP of Mt(X ) to the (conditional) LDP of Mt(X ,y) for ν-a.e. y ∈Y so that the following limit exists:

1 t lim log exp −L (x,y) dµθ(x) =: −V (θ). t→∞ t θ ZX  (2) Prove exponential continuity (Definition 3.5) of the map θ 7→ µθ with respect to the loss function L. Then posterior consistency follows from Lemma 3.3, Proposition 3.4 and Theorem 3.1. These steps can be taken in a generic way, meaning we can use any variant of large deviation behaviors that exist to guarantee the needed variational 16 LANGXUANSUANDSAYANMUKHERJEE characterization (3.1). In particular, we will apply a standard approach of directly proving the large deviation behavior and exponential contintuity in the case of hypermixing processes, while we will reduce the case of Gibbs processes on shifts of finite type to known results for i.i.d. process by exploit- ing the large deviation structure. These two different approaches highlights the flexibility of our framework.

3.5. Related Work. We now summarize previous work relating large de- viation theory and posterior consistency to our framework.

3.5.1. Large Deviations in Bayesian Inference. In [30], the authors prove the large deviation principle for posterior distributions of i.i.d. random variables with finite values as an inverse of Sanov’s Theorem, In [28, 51] large deviations results are obtained for posterior distributions of Markov chains with a finite state space. In [31], a large deviation result is proved for Dirichlet posteriors. In [34], the authors give a probabilistic justification of empirical likelihoods through large deviations.

3.5.2. Large Deviation Theory. A huge and growing body of literature is de- voted to the study of large deviations, and we refer to textbooks [6,15,29,54] for general introductions. In particular, the weak convergence approach based on relative entropy in [6,26] has a strong influence to our understand- ing of the large deviation phenomenon in this work. We survey some of the literature that are most relevant. Large deviation theory on i.i.d. processes can be found in the book [29]. Donsker and Varadhan established large deviations of certain Markov pro- cesses in a series of work [19–21], and the extension to hidden Markov models is established in [36]. Large deviations on some mixing stochastic processes are studied in [5, 12], and that on Gaussian processes are studied in [22]. In particular, we will apply our framework to processes satisfying the hy- permixing condition in [12]. Besides stochastic processes, large deviation properties are also shown in [38,50,55,65] for many well-behaved dynamical systems, such as Gibbs processes on mixing subshifts of finite type, Axiom A systems, and some non-uniform hyperbolic systems. We point out that following our strategy, one may adapt the proof for the large deviation be- havior of Gibbs states in [65] to obtain the posterior consistency result in [45] for Axiom A systems. The large deviation principle conditioning on a given observation y is referred as the conditional large deviation principle or the quenched large deviation principle. Quenched large deviation principles are widely studied in various settings, including irreducible Markov chains with random tran- sitions [57], random walks in a random environment [14], finite state Gibbs random fields [11] and some mixing processes [10]. As we pass from a single parameter to a mixture over the parameter space, we move to what is called the large deviation of mixtures or the annealed large deviation principle. This formalism has been widely studied for exchangeable sequences [18,64], POSTERIORCONSISTENCYINDYNAMICS 17 random walks in a random environment [14], and other stochastic mod- els [3]. The idea of exponential continuity we use appears in [18] and is implicit in the study of uniform large deviation principles of Markov pro- cesses with different initial conditions in [6,26], here the “parameter” is the initial condition of the process. Large deviation techniques extend to infinite dimensional processes [7] and random fields, i.e., processes indexed by lattices [54]. This highlights the potential applicability of our framework to a very rich class of models. There are also connections of various ideas with statistical mechanics, as presented in [29, 54, 60].

3.5.3. The Relative Entropy Rate in Ergodic Theory. There are connections between the relative entropy rate and other notions of entropy in ergodic the- ory, such as the Kolmogorov-Sinai entropy [63] of a dynamical system. One can either compare the limit definitions with other notions of entropy, using the variational characterization [6, Lemma 2.4.(e)] of the relative entropy over partitions of the space, or identify the relative entropy rate with the unique large deviation rate function of a well-behaved dynamical system. In the case of mixing shifts of finite type—including finite-state Markov chains or the Bernoulli shift—one obtains explicit characterizations of the relative entropy rate of Gibbs measures in terms of the pressure and the correspond- ing potential functions, see [9, 38, 49, 65] and the exposition in [54, Chapter 8.2]. For continuous space Markov chains as in [19–21], the relative entropy rate has an explicit expression in terms of the transition kernel and the stationary initial distribution, also called the Donsker-Varadhan entropy. Moreover, the divergence term in [47] is the conditional version of the rel- ative entropy replacing the usual Kolmogorov-Sinai entropy with the fibre entropy of a [39, 40]. In the case of Axiom A systems, the relative entropy rate of Riemannian or SRB measures are char- acterized by Lyapunov exponents, see [38, 65]. Also, [42] provides a more general perspective. Finally, we point out a recent application of the relative entropy rate in sensitivity analsyis and uncertainty quantification [27], where relative en- tropy rates of some Markov chains and some stochastic differential equations are computed explicitly as examples.

4. Example Processes One motivation for using large deviations to prove posterior consistency is to have a flexible framework that can be applied to a variety of processes. This is in contrast to having to design a new proof technique to show pos- terior consistency for each new class of stochastic processes or dynamical systems. In this section, we consider two very different process models: con- tinuous time hypermixing stochastic processes, and Gibbs processes on shifts of finite type. Our large deviations framework can used to prove posterior consistency for both settings. 18 LANGXUANSUANDSAYANMUKHERJEE

4.1. Hypermixing Processes. Most posterior consistency results make at least one of the following assumptions on the model processes: (1) it is a discrete time process; (2) the state space S is compact or discrete; (3) the dependency structure is simple, such as Markov chains; (4) the dynamics have a particular structure, such as Gaussian processes. We are motivated to prove posterior conistency for a general class of continuous time, dependent processes on general state spaces with good large deviation behavior. Hypermixing processes refer to a class of stochastic processes with certain strong mixing properties. In this section, we apply our framework to prove posterior consistency when the model family is a collection of hypermixing processes. We will focus on the continuous-time horizon T = [0, ∞), though the results can be stated and proved analogously for discrete-time processes. We start with relevant definitions. Given a closed interval I ⊂ T, we denote the σ-algebra FI = σ(Xt : t ∈ I). The following definitions come from [16, Section 5.4].

Definition 4.1. Given ℓ> 0, n ≥ 2, and real-valued functions f1,...,fn on X , we say that f1,...,fn are ℓ-measurably separated if there exist intervals ′ I1,...,In such that dist(Im,Im′ ) ≥ ℓ for 1 ≤ m

Definition 4.2. We say that µ ∈ M1(X , Φ) is hypermixing if there exist a number ℓ0 ≥ 0 and non-increasing α, β : (ℓ0, ∞) → [1, ∞) and γ : (ℓ0, ∞) → [0, 1] which satisfy (4.1) lim α(ℓ) = 1, lim sup ℓ(β(ℓ) − 1) < ∞, lim γ(ℓ) = 0 ℓ→∞ ℓ→∞ ℓ→∞ and for which n

(H-1) kf1 · · · fnkL1(µ) ≤ kfkkLα(ℓ)(µ) kY=1 whenever n ≥ 2, ℓ>ℓ0, and f1,...,fn are ℓ-measurably separated functions, and

(H-2) fgdµ − f dµ g dµ ≤ γ(ℓ) kfkLβ(ℓ)(µ) kgkLβ(ℓ)(µ) ZX ZX ZX  whenever ℓ>ℓ and f, g ∈ L1(µ) are ℓ-measurably separated. 0 In this section, we denote the law of a hypermixing process by µ. The main large deviation result in [12] states the following theorem.

Theorem 4.1. The empirical process Mt(X) of a hypermixing process X ∼ µ satisfies the large deviation principle with rate function h(λ k µ), if λ ∈ M (X , Φ), I(λ)= 1 (∞, otherwise. POSTERIORCONSISTENCYINDYNAMICS 19

Remark 4.1. We will see that property (H-1) of hypermixing processes is already enough for posterior consistency. Remark 4.2. In [5], a process with only a slight modification of property (H-2) satisfies the LDP, but the rate function may not coincide with the relative entropy rate. Properties of the relative entropy rate are crucial for our approach. 4.1.1. Concrete Examples of Hypermixing Processes. We now present some concrete examples of stochastic processes that are hypermixing. Example 4.1 (I.I.D. processes). Consider the discrete time case T = N. Let (Xt)t∈T be an S-valued i.i.d process with law µ on X , then µ is hypermixing.

Example 4.2 (Markov processes). Let (Xt)t∈T be an S-valued Markov pro- cess which can be realized on X = C(T, S), and we denote the associated Markov semigroup acting on bounded measurable functions on S by (Pt)t∈T. Let µ0 ∈ M1(S) be an invariant measure (in the sense of Markov processes) with respect to (Pt)t∈T, and µ ∈ M1(X , Φ) be the corresponding Φ-invariant measure on the path space X . Then according to [16, Theorem 6.1.3], µ is 2 4 hypermixing if and only if the operator norm of PT : L (µ0) → L (µ0) is one for some T > 0, i.e., the Markov semigroup (Pt)t∈T is µ0-hypercontractive. Remark 4.3. Hypercontractivity is equivalent to many other well studied properties such as the logarithmic Sobolev inequality, which can be veri- fied in a wide class of examples, such as hypoelliptic diffusions on compact manifolds. See [16, Chapter VI] for more details and [2] for a general intro- duction. Remark 4.4. More examples of hypermixing processes can be found at [12], including ǫ-Markov processes (i.e., generalized Markov processes where the current state depends on the past up to a time interval of length ǫ) and some Gaussian processes with certain decay in their spectral density matrices. 4.1.2. The Structure of the Large Deviation Proof–Hypermixing Case. We apply the strategy outlined in Section 3.4 to the case of hypermixing pro- cesses. We also detail the components of the proof. First, we work on one hypermixing process (Xt)t∈T ∼ µ. The immediate goal is to show for any f ∈ Cb,loc(X ×Y), it holds for ν-a.e. y ∈Y that 1 lim log exp f t(x,y) dµ(x)= sup f dλ − h(λ k µ ⊗ ν) . t→∞ t ZX λ∈J (Φ:ν) Z  We will established this in a series of steps:

(1) We show that given f ∈ Cb,loc(X ×Y), for ν-a.e. y ∈Y, the limit 1 p(f) := lim sup log exp f t(x,y) dµ(x) t t→∞ ZX  exists and does not depend on y. In particular, p : Cb,loc(X ×Y) → R is called the pressure functional. 20 LANGXUANSUANDSAYANMUKHERJEE

(2) We show that

h(λ k µ ⊗ ν) if λ ∈J (S : ν), p∗(λ)= (∞ else ∗ where p : M1(X ×Y) → R is the Fenchel-Legendre transform of p, i.e.,

∗ p (λ) = sup f dλ − p(f) : f ∈ Cb,loc(X ×Y) . Z  This establishes the existence of the relative entropy rate. (3) As suggested by Theorem 3.2, we show the upper bound 1 lim sup log exp f t(x,y) dµ(x) ≤ sup f dλ − h(λ k µ ⊗ ν) t t→∞ ZX λ∈J (Φ:ν) Z   by proving the large deviation upper bound of (Mt(X,y))t∈T. (4) We show the lower bound 1 lim inf log exp f t(x,y) dµ(x) ≥ sup f dλ − h(λ k µ ⊗ ν) t→∞ t ZX λ∈J (Φ:ν) Z   by exploiting the properties of the relative entropy rate and linearity of integration on measures. Next, we need to show that the parameterized family of processes are expo- nentially continuous to obtain posterior consistency. Towards this end, we introduce the notion of a regular family {µθ}θ∈Θ of hypermixing processes. We show that when {µθ}θ∈Θ is regular, it is exponentially continuous with respect to the loss function L.

4.1.3. The Pressure Functional. The main result of this subsection is exis- tence of the pressure functional p. Once p is established, the existence and properties of the relative entropy rate will hold immediately, and we will obtain the large deviation upper bound in the following subsections.

Theorem 4.2. If f ∈ Cb,loc(X×Y), then the following limits exist, and the equalities hold 1 p(f) := lim log exp f t(x,y) dµ(x) dν(y) t→∞ t ZY ZX  1  (4.2) = lim sup log exp f t(x,y) dµ(x) dν(y) t ZY t→∞ ZX  1  = limsup log exp f t(x,y) dµ(x). t t→∞ ZX  for y ∈ E, where E ⊆Y is a set of full ν-measure that is independent of f. The proof combines the ideas from [12] and [11, Lemma 2]. POSTERIORCONSISTENCYINDYNAMICS 21

Proof. Let f ∈ Cb,loc(X ×Y). Then f is Fr-measurable for some r ≥ 0. Let ′ ℓ>ℓ0 and r (ℓ)= ℓ + r. For y ∈Y, write e 1 p (f,y)= log exp f t(x,y) dµ(x). t t ZX  Fix s,t ∈ T,t>s. The idea is to divide [0,t] into smaller intervals of length s that are r′(ℓ)-separated to apply (H-1) on the smaller intervals. For u ∈ T, s,t let Ju be the collection of disjoint intervals of the form

u + (s + r′(ℓ))m + [0,s] ⊂ [0,t], m ∈ N,

s,t ′ and I¯ = [0,t] \ s,t I. Then for u ∈ [0,s + r (ℓ)], by properties of f and u I∈Ju (H-1), S

1 ¯s,t I pt(f,y) ≤ log exp |Iu |kfk + f (x,y) dµ(x) t X  s,t  Z IX∈Ju   (4.3) kfk ¯s,t 1 I = |Iu | + log exp f (x,y) dµ(x) t t X  s,t  Z IX∈Ju   kfk ¯s,t 1 I ≤ |Iu | + log exp α(ℓ)f (x,y) dµ(x), t tα(ℓ) X I∈Js,t Z Xu  s,t s,t where |I¯u | denotes the total length (the Lebesgue measure) of I¯u . Next, we average the inequality (4.3) over u to obtain a tighter estimate, which is given by the following lemma, whose proof is deferred to the end of this subsection.

Lemma 4.3. We have for any t>s,

(4.4) pt(f,y) ≤ As,t + Bs,t(y) where

lim lim As,t = 0, s→∞ t→∞

1 s t−s B (y) := p (α(ℓ)f, Ψuy) du. s,t tα(ℓ) s + r′(ℓ) s Z0 Remark 4.5. The idea of Lemma 4.3 is to observe

s,t (4.5) Ju = {u + [0,s] : u ∈ [0,t − s]} . ′ u∈[0,s[+r (ℓ)] 22 LANGXUANSUANDSAYANMUKHERJEE

An examination of the time intervals in the integrals on the right-hand side of (4.3) allows us to formally show, up to a vanishing remainder,

s+r′(ℓ) log exp α(ℓ)f I (x,y) dµ(x) du 0 X Z I∈Js,t Z Xu  t−s = log exp α(ℓ)f u+[0,s](x,y) dµ(x) du, 0 X Z Z   and apply Φ-invariance of µ. The above equality is exact and straightforward in discrete time, where integrals in time are replaced by summations, and the equality follows by rearranging the summations using (4.5). For continuous time, we pass to the limit of the Riemann integrals. (Proof of Theorem 4.2, Continued.) From (4.3) and Lemma 4.3, we have

pt(f,y) ≤ As,t + Bs,t(y). Integrate with respect to ν on both sides and obtain

pt(f,y) dν(y) ≤ As,t + Bs,t(y) dν(y) ZY ZY where by the Ψ-invariance of ν, t − s s B (y) dν(y)= p (α(ℓ)f,y) dν(y) s,t tα(ℓ) s + r′(ℓ) s ZY ZY t − s s ≤ p (f,y) dν(y) + (α(ℓ) − 1)kfk , t s + r′(ℓ) s ZY  and we used the bound 1 (4.6) p (α(ℓ)f,y) ≤ p (f,y) + (α(ℓ) − 1)kfk. α(ℓ) s s By taking t →∞, s →∞ and ℓ →∞ successively, we obtain

lim sup pt(f,y) dν(y) ≤ lim inf ps(f,y) dν(y) s→∞ t→∞ ZY ZY Therefore, the limit p(f) exists. Recall from (4.4), we have 1 s t−s p (f,y) ≤ A + p (α(ℓ)f, Ψuy) du. t s,t tα(ℓ) s + r′(ℓ) s Z0 By taking t →∞, we apply Birkhoff’s pointwise ergodic theorem (Theorem B.5) and (4.6) on the right hand side above and obtain, for ν-a.e. y ∈ Y independent of f, s lim sup pt(f,y) ≤ lim As,t + ps(f,y) dν(y) + (α(ℓ) − 1)kfk . t→∞ s + r′(ℓ) t→∞ ZY  POSTERIORCONSISTENCYINDYNAMICS 23

By taking s → ∞ and then ℓ → ∞, we have lim sup pt(f,y) ≤ p(f). By t→∞ Fatou’s lemma, we also have

p(f) ≤ lim sup pt(f,y) dν(y). ZY t→∞ Thus, it must be limsup pt(f,y)= p(f) for ν-a.e. y ∈Y.  t→∞ Remark 4.6. When T = N and µ is the law of an i.i.d. process, the lim sup in Theorem 4.2 can be replaced by the exact limit. To complete the argument, we present the proof of Lemma 4.3.

s,t Proof of Lemma 4.3. Note that the number of intervals in Ju is at least ′ t+r (ℓ) ′ ¯s,t t−s s+r′(ℓ) − 1 for u ∈ [0,s + r (ℓ)], so |Iu |≤ t − ⌊ s+r′(ℓ) ⌋s. We write j k F I (y) = log exp α(ℓ)f I (x,y) dµ(x). ZX m m m ′  ′ Let 0 = u0 < u1 < · · · < um = s + r (ℓ) be a partition of [0,s + r (ℓ)] such s+r′(ℓ) that ui+1 − ui = m . Then by ordering the left endpoints of intervals m−1 s,t m m m in i=0 Jui , we obtain a partition 0 = v0 ≤ v1 ≤ . . . ≤ vn−1 ≤ t − s ′ m m s+r (ℓ) of [0,t − s] by setting vn = t − s, and we have maxj vj+1 − vj ≤ m . S ′ s,t t+r (ℓ) ′ Note that Ju has at most ⌊ s+r′(ℓ) ⌋ intervals for any u ∈ [0,s + r (ℓ)], so t+r′(ℓ) t+r′(ℓ) ⌊ s+r′(ℓ) ⌋− 1 m ≤ n ≤ ⌊ s+r′(ℓ) ⌋m.  Consider the Riemann sum ′ m−1 ′ n−1 s + r (ℓ) s + r (ℓ) m F I (y)= F vj +[0,s](y). m m i=0 I∈Js,t j=0 X Xui X Note that

′ n−1 n−1 s + r (ℓ) m m F vj +[0,s](y) − (vm − vm)F vj +[0,s](y) m j+1 j j=0 j=0 X X n(s + r′(ℓ)) ≤ skfk − (t − s) = o(t), m

where o(t) denotes a term with lim o(t) /t = 0 and hence, t→∞ ′ m−1 n−1 s + r (ℓ) m lim F I (y) ≤ o(t) + lim (vm − vm)F vj +[0,s](y) m→∞ m m→∞ j+1 j i=0 I∈J j=0 X Xui X t−s = o(t)+ F u+[0,s](y) du. Z0 24 LANGXUANSUANDSAYANMUKHERJEE

Therefore, by averaging the right-hand side of the inequality (4.3), m−1 1 1 s + r′(ℓ) p (f,y) ≤ A + F I (y). t s,t tα(ℓ) s + r′(ℓ) m i=0 I∈J X Xui where kfk t − s o(t) A = t − s + . s,t t s + r′(ℓ) t     By taking m →∞ and by the Φ-invariance of µ, t−s 1 1 u+[0,s] Bs,t(y)= ′ log exp α(ℓ)f (x,y) dµ(x) du tα(ℓ) s + r (ℓ) 0 X Z t−s Z   1 1 [0,s] u = ′ log exp α(ℓ)f (x, Ψ y) dµ(x) du tα(ℓ) s + r (ℓ) 0 X Z Z   1 s t−s = p (α(ℓ)f, Ψuy) du. tα(ℓ) s + r′(ℓ) s Z0 

4.1.4. The Relative Entropy Rate. With the existence of the pressure func- tional p, the existence of the relative entropy rate can be established by a convex duality argument. We also prove properties of the relative entropy rate which will be crucial for proving the lower bound. Proposition 4.4. For λ ∈J (Φ : ν), h(λ k µ⊗ν) as defined by Definition 3.4 exists as h(λ k µ ⊗ ν) = p∗(λ), where p∗ is the Fenchel-Legendre transform of p, i.e.,

∗ (4.7) p (λ) = sup f dλ − p(f) : f ∈ Cb,loc(X ×Y) . Z  In particular, we have for ν-a.e. y ∈Y, 1 (4.8) lim K(λy|F k µ|F )= h(λ k µ ⊗ ν). t→∞ t t t The proof of Proposition 4.4 requires the following two lemmas, whose proofs are deferred to Appendix A.3. The first one provides a useful esti- mate for obtaining p∗. The second one establishes a generalized subaddtive property for proving (4.8), which can be viewed as a subadditive ergodic result, and combined with Appendix C, it may be of independent interest for ergodic theory of random dynamical systems.

Lemma 4.5. If f ∈ Cb,loc(X ×Y) is Fr-measurable, then for ℓ>ℓ0, 1 (4.9) p(f) ≤ log e e(ℓ+r)α(ℓ)f(x,y)dµ(x) dν(y). (ℓ + r)α(ℓ) ZY ZX  Lemma 4.6. Let λ ∈ J (Φ : ν) with its disintegration λ = λy ⊗ δy dν(y) (see Theorem B.1). For t ≥ 0, define a function F : Y → [−∞, 0] by t R POSTERIORCONSISTENCYINDYNAMICS 25

Ft(y) = −K(λy|Ft k µ|Ft ). Fix ℓ>ℓ0. Then for ν-a.e. y ∈ Y, any n ∈ N and t1,...,tn ∈ [0, ∞), n − 1 (k−1)ℓ+Pk 1 t n j=1 j F(n−1)ℓ+P t (y) ≤ Ftk (Ψ y). k=1 k α(ℓ) Xk=1 Proof of Proposition 4.4. We start with the first statement. Fix r ≥ 0 and let f ∈ Cb,loc(X ×Y) bea Fr-measurable function. By the definition of p∗, Lemma 4.5 and Jensen’s inequality, f e f p∗(λ) ≥ dλ − p (ℓ + r)α(ℓ) (ℓ + r)α(ℓ) Z   1 ≥ f dλ − log ef(x,y) dµ(x) dν(y) (ℓ + r)α(ℓ) Z ZY ZX   1 ≥ f dλ − log ef d(µ ⊗ ν) . (ℓ + r)α(ℓ) Z Z  K(λ| e k µ⊗ν| e ) ∗ Fr Fr Taking suprememum over such f implies p (λ) ≥ (ℓ+r)α(ℓ) , so by taking r →∞, we obtain p∗(λ) ≥ h(λ k µ ⊗ ν). To prove the other side of inequality, note that by uniqueness of disinte- gration (Theorem B.1) and the chain rule of the relative entropy (Theorem B.3), we have

K(λ| k µ ⊗ ν| )= K(ν | ν)+ K(λy|Ft k µ|Ft ) dν(y) Fet Fet ZY

= K(λy|Ft k µ|Ft ) dν(y). ZY t t Now note that f is Ft+r-measurable. In particular, for every y ∈Y, f (·,y) is Ft+r-measurable. Then by the variational property of relative entropy (Theorem B.2), e

t t (4.10) K(λy|Ft+r k µ|Ft+r ) ≥ f dλy − log exp f (x,y) dµ(x). Z ZX Integrate both sides with respect ν and obtain 

t t K(λ|F k µ ⊗ ν|F ) ≥ f dλ − log exp f (x,y) dµ(x) dν(y) et+r et+r Z ZY ZX   = t f dλ − log exp f t(x,y) dµ(x) dν(y). Z ZY ZX  Divide both sides by t and take t →∞ and get 

h(λ k µ ⊗ ν) ≥ f dλ − p(f). Z Theorefore, h(λ k µ ⊗ ν) ≥ p∗(λ). We show the last statement. For t ≥ 0, define a sequence of function

Ft : Y → [−∞, 0] by Ft(y) = −K(λy|Ft k µ|Ft ). Based on Appendix C, 26 LANGXUANSUANDSAYANMUKHERJEE since (C.1) holds by Lemma 4.6, then by using Proposition C.4, we have Ft(y) limt→∞ t exists for ν-a.e. y ∈Y, and F (y) lim t dν(y)= −h(λ k µ ⊗ ν). t→∞ t ZY Ft Since limt→∞ t is a Ψ-invariant function (see [1]) and ν is Ψ-ergodic, it must hold for ν-a.e. y ∈Y that 1 lim K(λy|F k µ|F )= h(λ k µ ⊗ ν). t→∞ t t t  Remark 4.7. When T = N and µ is the law of an i.i.d. process, (4.8) follows directly from the subadditive ergodic theorem [1]. Once the existence of the relative entropy rate is established, the following properties are known consequences. Proposition 4.7. The relative entropy rate h(· k µ ⊗ ν) is a convex and lower semicontinuous function on J (Φ : ν). Moreover, h(· k µ ⊗ ν) is affine (see Lemma B.9) on J (Φ : ν). Proof. The proof is the same as that for the same properties of h(· k µ) in [54, Proposition 6.8] and [16, (5.4.23)].  4.1.5. The Upper Bound. We now prove the following upper bound.

Proposition 4.8. Let f ∈ Cb(X ×Y). Then for ν-a.e. y ∈Y, 1 lim sup log exp f t(x,y) dµ(x) ≤ sup f dλ − h(λ k µ ⊗ ν) . t t→∞ ZX λ∈J (Φ:ν) Z   According to Theorem 3.2, it suffices to show the large deviation upper bound (3.5) for empirical process (Mt(X,y))t∈T with X ∼ µ, for ν-a.e. y ∈ Y. We first establish the weak large deviation upper bound, which together with exponential tightness implies the large deviation upper bound, as a common procedure in large deviation theory. See [54, Theorem 2.19]. Lemma 4.9. For ν-a.e. y ∈ Y, it holds that for any compact set K ⊂ M1(X ×Y), 1 ∗ lim sup log µ (Mt(·,y) ∈ K) ≤− inf p , t→∞ t K where p∗ is the Fenchel-Legendre transform (also known as the convex con- jugate) of p, i.e.,

∗ p (λ) := sup f dλ − p(f) : f ∈ Cb,loc(X ×Y) . Z  In other words, for ν-a.e. y ∈ Y, the empirical process (Mt(X,y))t∈T with X ∼ µ satisfies the weak large deviation upper bound with rate function p∗. Moreover, p∗(λ) = ∞ if λ is not (Φ × Ψ)-invariant, or λ does not have Y-marginal ν. POSTERIORCONSISTENCYINDYNAMICS 27

With Theorem 4.2 in place, the proof of Lemma 4.9 is standard in large deviation theory, so we defer it to Appendix D. Now we can prove the desired upper bound. Proof of Proposition 4.8. By the first part of Lemma 4.9, we know that for ν-a.e. y ∈ Y, the empirical process (Mt(X,y))t∈T with X ∼ µ satisfies the ∗ weak large deviation upper bound with rate function p . Since (Mt(X))t∈T is exponentially tight (implied by its LDP), then by Lemma D.1, for ν-a.e. y ∈ Y, (Mt(X,y))t∈T is also exponentially tight, hence satisfying the large deviation upper bound (3.5). The statement follows from Theorem 3.2, Proposition 4.4 and the second part of Lemma 4.9.  4.1.6. The Lower Bound. We now prove the following lower bound.

Proposition 4.10. Let f ∈ Cb(X ×Y). Then for ν-a.e. y ∈Y, 1 lim inf log exp f t(x,y) dµ(x) ≥ sup f dλ − h(λ k µ ⊗ ν) . t→∞ t ZX λ∈J (Φ:ν) Z   Remark 4.8. When T = N and µ is the law of an i.i.d. process, the lower bound follows directly from the inequality p∗∗ ≤ p in convex duality, where p∗∗ is the convex conjugate of p∗. We apply a standard approach for proving lower bounds, see [54, Proof of Theorem 6.13] and [16, Lemma 5.4.21]. We will use the affine property of the relative entropy rate and the integrated cost as a linear (hence affine) functional on measures, and see that it is enough to prove the case when λ is ergodic. We do not need to consider the full large deviation lower bound (3.6). Moreover, the proof holds for other processes as long as the corre- sponding relative entropy rate is an affine function on invariant measures. The following lemma is the main technical step.

Lemma 4.11. Let λ ∈ Je(Φ : ν). Then for ν-a.e. y ∈ Y, it holds that for any open neighborhood U ⊂ M1(X ×Y) of λ, 1 lim inf log µ (Mt(·,y) ∈ U) ≥−h(λ k µ ⊗ ν). t→∞ t Proof. Assume that h(λ k µ ⊗ ν) < ∞. Then K(λ| k µ ⊗ ν| ) < ∞ for all Fet Fet t ≥ 0. Pick a countable sequence of time tn →∞. Consider the disintegra- tion λ = λy ⊗ δy dν(y), see Theorem B.1. We can take a set E ⊂ Y of full ν-measure of y ∈Y such that K(λ | k µ| ) < ∞. Since t →∞, if R y Ftn Ftn n y ∈ E, K(λy|Ft k µ|Ft ) < ∞, for any t ≥ 0. y dλy For y ∈ E, let gt := dµ : X → [0, ∞] be the Radon-Nikodym derivative on Ft. Recall that M1(X ×Y) is a Polish space, which has a countable base. Let U be a base element containing λ. By Appendix B.1, U contains the subset

η ∈ M1(X ×Y) : fi dη − fi dλ < ǫ,i = 1,...,m  Z Z 

28 LANGXUANSUANDSAYANMUKHERJEE for some ǫ> 0 small enough and some f1,f2,...,fm ∈ Cb,loc(X ×Y), which are Fr-measurable for r large enough. Without loss of generality, it suffices to consider U of this form. Then the event {Mt(·,y) ∈ U} is Ft+r measurable y for relarge enough. Let At(y) = x ∈X : Mt(x,y) ∈ U, gt+r(x) > 0 . By change of measure and Jensen’s inequality,  1 log µ (M (·,y) ∈ U) t t 1 ≥ log (gy )−1(x) dλ (x) t t+r y ZAt(y) 1 1 1 = log λ (A (y)) + log (gy )−1(x) dλ (x) t y t t λ (A (y)) t+r y y t ZAt(y) ! 1 1 ≥ log λ (A (y)) − log gy (x) dλ (x) . t y t tλ (A (y)) t+r y y t ZAt(y) ! By using x log x ≥−1/e for x> 0,

y y y y log gt+r(x) dλy(x)= log gt+r dλy − gt+r log gt+r dµ ZAt(y) ZX ZAt(y) ≤ K(λy|Ft+r k µ|Ft+r ) + 1/e. By Birkhoff’s pointwise ergodic theorem,

λ ({(x,y) ∈X×Y : Mt(x,y) ∈ U}) −→ 1. Thus, there exists a set E′ ⊂Y of full ν-measure such that if y ∈ E′,

λy(Mt(·,y) ∈ At(y)) = λy(Mt(·,y) ∈ U) −→ 1. Combining the above and (4.8), we have for y ∈ E ∩ E′, 1 1 lim inf log µ (Mt(·,y) ∈ U) ≥− lim K(λy|F k µ|F )= −h(λ k µ ⊗ ν). t→∞ t t→∞ t t t 

Proof of Proposition 4.10. Note that by Proposition 4.7, the expression in- side the supremum of the right hand side is affine in λ. By Lemma B.9 and Proposition B.10, it suffices to prove for ν-a.e. y ∈Y, 1 (4.11) lim inf log exp f t(x,y) dµ(x) ≥ f dλ − h(λ k µ ⊗ ν), t→∞ t ZX Z  where λ ∈ Je(Φ : ν) belongs to a maximizing sequence of the right hand side. Fix ǫ> 0. Define the set

U = η ∈ M1(X ×Y) : fdη > f dλ − ǫ .  Z Z  POSTERIORCONSISTENCYINDYNAMICS 29 which is an open neighborhood of λ. Then by Lemma 4.11, for ν-a.e. y ∈Y, 1 lim inf log exp f t(x,y) dµ(x) t→∞ t ZX 1  ≥ lim inf log exp f t(x,y) dµ(x) t→∞ t Z{Mt(·,y)∈U} 1  ≥ f dλ − ǫ + lim inf log µ(Mt(·,y) ∈ U) t→∞ t Z ≥ f dλ − ǫ − h(λ k µ ⊗ ν). Z We can take a countable sequence of ǫ → 0 so that we still have a set of full ν-measure of such y ∈Y for (4.11). 

4.1.7. Exponential Continuous Family. Assume we have a model family {µθ}θ∈Θ of hypermixing processes, and we define the function V :Θ → R by (3.2). We have shown that given a fixed θ ∈ Θ, it holds for ν-a.e. y ∈Y,

1 t (4.12) lim log exp(−L (x,y)) dµθ(x)= −V (θ), t→∞ t θ ZX as long as Lθ ∈ Cb(X ×Y), by Proposition 4.8 and Proposition 4.10. Now we introduce a sufficient condition on the family of hypermixing processes for posterior consistency.

Definition 4.3. A family {µθ}θ∈Θ of hypermixing processes is called regular if

(1) the map θ 7→ µθ is continuous, and (2) if (θt)t∈T is a sequence in Θ with θt −→ θ, then there exists constants C > 0, ℓ0 > 0, and non-increasing αt : (ℓ0, ∞) → [1, ∞) which

corresponds to the α in (H-1) for µθt such that supt∈T αt(ℓ0)

Definition 4.3 says that any family of hypermixing processes {µθt } indexed by a converging sequence of parameters (θt) do not have arbitrarily slow de- cay of correlation. In the case of Markov processes, one can often compute explicitly the mixing coefficients α (see [16, Chapter VI]), so this condition can be verified. The main result of this subsection is the following proposition, which is the final step of proving posterior consistency on a regular family of hypermixing processes. Proposition 4.12. Suppose L is a loss function satisfying Assumption 3.1, and {µθ}θ∈Θ is a regular family of hypermixing processes. Then {µθ}θ∈Θ is an exponentially continuous family with respect to the loss function L. 30 LANGXUANSUANDSAYANMUKHERJEE

As mentioned in Section 3.3, the following lemma is a useful intermediate step for proving exponential continuity, the proof adapts an exponential approximation argument used in the proof of [64, Lemma 2.5]. Lemma 4.13. Suppose L is a loss function satisfying Assumption 3.1 and {µθ}θ∈Θ is a regular family of hypermixing processes. If θ ∈ Θ, then it holds that for every y ∈ Y that if (θt)t∈T is a sequence in Θ such that θt −→ θ, then there exists a probability space (Ω, F, P) such that for all ǫ> 0, 1 1 (4.13) lim sup log P (Lt (Xθt ,y) − Lt (Xθ,y)) >ǫ = −∞, t t θ θ n→∞   which implies, for ν-a.e. y ∈Y ,

1 t (4.14) lim log exp(−L (x,y)) dµθ (x)= −V (θ). t→∞ t θ t ZX Proof. Fix θ ∈ Θ. Let {θt}t∈T be a sequence in Θ such that θt −→ θ. For t θt θ ease of notation, we write X := X , X := X , f := −Lθ, and 1 d (Xs,X,y) := (f t(Xs,y) − f t(X,y)) . t t

We start with proving (4.13). By Skorohod’s Representation Theorem, t there exists a probability space (Ω, F, P) on which (X , X) has law µθt ⊗ µθ and Xt −→ X as t → ∞, P-almost surely. Fix ǫ > 0 and N > 0. By the Markov-Chebyshev inequality,

1 t lim sup log P dt(X ,X,y) >ǫ t→∞ t 1  t ≤−Nǫ + limsup log E exp Ntdt(X ,X,y) . t→∞ t It suffices to show   1 t lim sup log E exp Ntdt(X ,X,y) = 0. t→∞ t

  θt By Assumption 4.11, f ∈ Cb,loc(X ×Y). Also, it holds that (Xs , Xs)s∈T is hypermixing on (X 2, Φ × Φ) ≃ (C(T, S2), Φ × Φ), for every t ∈ T. As in the proof of Theorem 4.2, we write 1 p (N,t,y)= log E exp Nsd (Xt,X,y) , s s s and following the same argument of Lemma 4.3, we have for fixed t>s, there exists ℓ> 0 such that 1 s t−s p (N,t,y) ≤ A + p (CN,t, Ψuy) du, t s,t t s + r + ℓ s Z0 where the constant C comes from Definition 4.3, and lims→∞ limt→∞ As,t = 0. By Assumption 4.11, s t r t r sup sds(X ,X,y) ≤ (ω(dΘ(θt,θ)) + ω(dX (Φ X , Φ X))) dr y Z0 POSTERIORCONSISTENCYINDYNAMICS 31 which goes to zero as t → ∞. Therefore, by Lebesgue dominated conver- u gence, supu ps(CN,t, Ψ y) −→ 0 as t → ∞. Then by taking t → ∞ and s →∞ successively, we have

lim sup pt(N,t,y) = 0. t→∞ as desired. The proof of (4.14) follows the same argument of [6, Theorem 1.17], so we defer it to Appendix A.3. 

Proof of Proposition 4.12. Fix θ ∈ Θ. Let {θt}t∈T be a sequence in Θ such that θt −→ θ. By Lemma 4.13, take y ∈Y from a set of full ν-measure such that (4.12) and (4.14) hold. Then 1 lim sup log exp(−Lt (x,y)) dµ (x) t θt θt t→∞ ZX 1 ≤ lim sup log exp(−Lt (x,y)+ tω(d (θ ,θ))) dµ (x) t θ Θ t θt t→∞ ZX 1 = limsup ω(d (θ ,θ)) + lim sup log exp(−Lt (x,y)) dµ (x) Θ t t θ θt t→∞ t→∞ ZX = −V (θ). The other side of inequality follows from replacing ω with −ω and limsup with liminf. 

4.2. Gibbs Processes on Shifts of Finite Type. The previous example allowed us to prove posterior consistency for some families of continuous time stochastic processes. We now show that the same framework can be used to prove posterior consistency of Gibbs processes on shifts of finite type. These are discrete time, discrete state dynamical systems that include finite state Markov chains with arbitrary order of dependencies. As posterior consistency was already proven in [45], we will be concise on some of the technical details. Our motivation for including this example is to highlight the flexibility of the large deviation framework can be used to prove posterior consistency. Consider T = N and S is a finite alphabet endowed with the discrete topology. We endow X = SN with the product topology and a compatible ′ −n(x,x′) ′ metric dX , for example, dX (x,x ) = 2 where n(x,x ) is the infimum ′ of those m such that xm 6= xm. Definition 4.4. Given a H¨older continuous potential function φ : X → R, a Borel probability measure µ ∈ M1(X , Φ) is the corresponding , if the following Gibbs property is satisfied: there exists constants P ∈ R and N > 0 s.t. for all x ∈X and t ≥ 1, µ([x ,x ,...,x ]) (4.15) N −1 ≤ 0 1 t−1 ≤ N. exp (−Pt + φt(x)) 32 LANGXUANSUANDSAYANMUKHERJEE

t where we write [w]= {x¯ ∈X ;x ¯i = wi, i = 0,...,t − 1} for any w ∈ S . We call the process (Xt)t∈T with law µ the Gibbs process corresponding to φ. For notational convenience, we focus on the case of the one-sided full shift, but one may also prove the same result for two-sided mixing sub- shift as in [45]. Also, we refer back to [45] for concrete examples of Gibbs processes which include finite state Markov chains with arbitrary order of dependencies. 4.2.1. Convergence for one Gibbs process. Recall that the first step is to prove (3.4) for a single Gibbs measure µ. Unlike the previous example, instead of directly proving the needed large deviation behavior for µ, we re- duce the problem to the case of product measures, using ideas of exponential tilting. Let σ0 ∈ M1(S) be the uniform probability measure on the finite alphabet ⊗T S, and let σ = σ0 ∈ M1(X ) be the corresponding product measure on X , which is hypermixing.

Lemma 4.14. If E ∈ Ft−1, then

−1 µ(E) N ≤ t t ≤ N. |S| E exp(−Pt + φ (x))dσ(x) Proof. Let E ∈ Ft−1. ThenR 1E only depends on the first t coordinates, so by conditioning on the first t coordinates and the Gibbs property (4.15),

µ(E)= µ([w])1E (w) t wX∈S t ≤ N min 1E(x) exp(−Pt + φ (x)) t x∈[w] wX∈S 1 t ≤ N 1E(x) exp(−Pt + φ (x)) dσ(x) t σ([w]) [w] wX∈S Z = N|S|t exp(−Pt + φt(x))dσ(x). ZE The other side of inequality is proved similarly by replacing min with max. 

For convenience, we write hη, fi := f dη for η ∈ M1(X ×Y) and f ∈ Cb,loc(X ×Y). Following the same idea in [54, Chapter 6], we denote the R (t) (t) periodic empirical process by Mt(x,y) := Mt(x ,y), where x ∈ X is (t) (t) (t) given by xs = xs for 0 ≤ s ≤ t − 1 and xs = xr if s = r mod t. For any f ∈ Cb,loc(X ×Y) and any y ∈Yf, e e e e e (4.16) lim sup hMt(x,y), fi− Mt(x,y), f = 0. t→∞ x∈X D E

The key point of introducing Mt(x,y) is thatf1 is Ft−1-measurable {Mft(·,y)∈E} for any Borel set E ⊂ M1(X ×Y). Also, when X ∼ σ, using (4.16), we can f POSTERIORCONSISTENCYINDYNAMICS 33 prove the same large deviation result for Mt(X,y) as that for Mt(X,y) in the previous section. Hence, we can reduce the problem of proving (3.4) of the Gibbs measure µ to the product measuref σ by Lemma 4.14: 1 lim log exp(f t(x,y)) dµ(x) t→∞ t ZX 1 = lim log exp (t hMt(x,y), fi) dµ(x) t→∞ t ZX 1 = lim log exp t Mt(x,y), f dµ(x) t→∞ t ZX 1D E = log |S| − P + lim logf exp t Mt(x,y), f + φ dσ(x) t→∞ t X Z  D E = log |S| − P + sup f + φ dλf− h(λ k σ ⊗ ν) λ∈J (Φ:ν) Z  for ν-a.e. y ∈Y. One can check, as Example 3.4, that h(λ k σ ⊗ ν) = log |S|− hν(λ), where hν(λ) is the fibre entropy (see [40]) of λ with respect to ν. Therefore, we recover that for any f ∈ Cb,loc(X ×Y), it holds that for ν-a.e. y ∈Y, 1 lim log exp(f t(x,y)) dµ(x)= sup f dλ − P + φ dλ + hν (λ) . t→∞ t ZX λ∈J (Φ:ν) Z Z  Although the above is already sufficient for our Assumption 3.1 on loss functions, we note that with a little more work we can recover the above equality for f ∈ Cb(X ×Y). One can also prove that for λ ∈J (Φ : ν),

h(λ k µ ⊗ ν)= P− φ dλ − hν(λ) Z by following similar ideas in [9] and [45, Lemma 7]. Remark 4.9 (Gibbs processes on general state spaces). One can establish the same convergence result for Gibbs processes on general state spaces by a similar exponential tilting argument (see [54, Chapter 8]), but the exponential continuity in this case may not be as straightforward. 4.2.2. Exponentially Continuous Family. We first recall the definition of a regular model family {µθ}θ∈Θ in [45]. Recall that to obtain the main con- vergence result (3.1), it suffices to prove exponential continuity (Definition 3.5) of {µθ}θ∈Θ with respect to the loss function L.

Definition 4.5. A family {φθ}θ∈Θ of potential functions is regular if all the φθ : X → R are r-H¨older continuous for some r > 0 and the map θ 7→ fθ is continuous with respect to the usual H¨older norm k · kr. We denote the corresponding regular family of Gibbs measures by {µθ}θ∈Θ.

When {φθ}θ∈Θ is a regular family, the corresponding map θ 7→ µθ is continuous with respect to the weak topology on measures. Moreover, the maps from θ to the corresponding constants Nθ and Pθ in (4.15) are also 34 LANGXUANSUANDSAYANMUKHERJEE continuous. As a result, by the compactness of Θ, we have a uniform Gibbs property: there exist a uniform constant N and a continuous map θ → Pθ such that

−1 µ({x¯ ∈X ;x ¯i = xi, i = 0,...,t − 1}) (4.17) N ≤ t ≤ N. exp −Pθt + φθ(x)  We have shown that given a fixed θ ∈ Θ, for ν-a.e. y ∈Y,

1 t lim log exp(−L (x,y)) dµθ(x)= −V (θ), t→∞ t θ ZX as long as Lθ ∈ Cb,loc(X ×Y), where

ν V (θ) := inf Lθ dλ + Pθ − φθ dλ − h (λ) . λ∈J (Φ:ν) Z Z 

Finally, we show that {µθ}θ∈Θ is exponential continuous (Definition 3.5) with respect to the loss function L, by exploiting the uniform Gibbs property (4.17) in basically the same way as in [45].

Lemma 4.15. Suppose L is a loss function satisfying Assumption 3.1. Then for any ǫ > 0, there exists T > 0 such that for all θ,θ′ ∈ Θ, y ∈ Y, and t ≥ 1,

t exp(−L (x,y)) dµθ(x) ′ X θ 2 t(ω(dΘ(θ,θ ))+ǫ)+(t+T )(|Pθ −Pθ′ |+kφθ−φθ′ k) t ′ ′ ≤ N e . exp(−L ′ (x ,y)) dµ ′ (x ) XR θ θ R Proof. Let θ,θ′ ∈ Θ and y ∈ Y. Note that by the uniform Gibbs property (4.17), for any t ≥ 1 and w ∈ St,

µθ([w]) (4.18) ≤ N 2et(|Pθ −Pθ′ |+kφθ−φθ′ k). µθ′ ([w])

Also, by Assumption 3.1, we see that

t exp(−L (x,y)) ′ t−1 s s ′ θ t(ω(dΘ(θ,θ ))+Ps=0 ω(dX (Φ x,Φ x )) (4.19) t ′ ≤ e . exp(−Lθ′ (x ,y)) POSTERIORCONSISTENCYINDYNAMICS 35

′ ′ Take T > 0 such that for any x,x ∈X with xi = xi for 0 ≤ i ≤ T − 1, we ′ have dX (x,x ) <ǫ. Using (4.19) and (4.18), we obtain

t exp(−Lθ(x,y)) dµθ(x) ZX t = µθ([w]) exp(−Lθ(x,y)) dµθ(x | [w]) t+T [w] w∈SX Z ′ t(ω(dΘ(θ,θ ))+ǫ) t ′ ′ ≤ e µθ([w]) exp(−Lθ′ (x ,y)) dµθ′ (x | [w]) [w] w∈St+T Z ′ X ≤ N 2et(ω(dΘ(θ,θ ))+ǫ)+(t+T )(|Pθ −Pθ′ |+kφθ−φθ′ k)

t ′ ′ · µθ′ ([w]) exp(−Lθ′ (x ,y)) dµθ′ (x | [w]) [w] w∈St+T Z X ′ = N 2et(ω(dΘ(θ,θ ))+ǫ)+(t+T )(|Pθ −Pθ′ |+kφθ−φθ′ k)

t ′ ′ · exp(−Lθ′ (x ,y)) dµθ′ (x ). ZX 

Corollary 4.15.1. Suppose L is a loss function satisfying Assumption 3.1, and {µθ}θ∈Θ is a regular family of Gibbs processes. Then {µθ}θ∈Θ is an exponentially continuous family with respect to the loss function L. Proof. Fix θ ∈ Θ. Then for ν-a.e. y ∈Y,

1 t lim log exp(−L (x,y)) dµθ(x)= −V (θ). t→∞ t θ ZX Let y ∈ Y be as above. By Lemma 4.15, for any ǫ > 0, there exists T > 0 such that

1 t lim log exp(−L (x,y)) dµθ (x) − (−V (θ)) t→∞ t θt t ZX

1 t t ≤ lim log exp(−L (x,y)) dµθ (x) − log exp(−L (x,y)) dµθ(x) t→∞ t θt t θ ZX ZX

t + T ≤ ǫ + lim ω(dΘ(θt,θ)) + (|Pθ − Pθ| + kφθ − φθk) = ǫ. t→∞ t t t   Hence, we obtain the desired result. 

5. Discussion The motivation for this paper was to propose a general framework for proving large deviations results for inference in dynamical systems. It is of interest to explore the breadth for applications of the framework we propose. A natural class of candidate models are Bayesian inverse problems that arise in uncertainty quantification. Another class of models we do not consider in this paper arise where the dynamic model class is not characterized as 36 LANGXUANSUANDSAYANMUKHERJEE a stochastic process but as an algorithm, examples of this include learning using recurrent neural networks such as long short-term memory networks. There are technical conditions in our framework that are natural to relax such as those of the loss function. Our framework starts with Varadhan’s Lemma which requires the loss function to be continuous and bounded. It is not unreasonable to consider more general loss functions and work with gen- eralizations of Varadhan’s lemma developed in large deviation theory that do not require continuity or boundedness, see [6, Theorem 1.18 & 1.20]. It is also of interest to explore our framework for discontinuous processes. Our framework is not restricted to continuous processes, since process-level large deviation results for discontinuous processes are also known [16, Theorem 5.4.27]. In that case, we let X = D(T, S) be the Skorohod space equipped with the Skorohod topology and follow the framework we have outlined. It is tempting to consider posterior consistency over non-compact param- eter spaces. Most work in posterior consistency, including ours, requires compactness of the parameter space. A basic question is whether compact- ness can be relaxed. In our large deviation framework, posterior consistency is obtained by requiring a process level large deviation behavior. In [64], the author proves that the process level LDP of a mixture of i.i.d. processes {µθ}θ∈Θ by the fully supported prior π0 is equivalent to the compactness of the parameter space Θ. This suggests that from the perspective of large deviation theory, the assumption on compactness of Θ may be difficult to relax.

References [1] Artur Avila and Jairo Bochi. On the subadditive ergodic theorem. 2009. [2] Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and geometry of Markov diffusion operators, volume 348. Springer Science & Business Media, 2013. [3] JD Biggins. Large Deviations for Mixtures. Electronic Communications in Probability, 9:60–71, 2004. [4] Pier Giovanni Bissiri, Chris C Holmes, and Stephen G Walker. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):1103–1130, 2016. [5] W lodzimierz Bryc and Amir Dembo. Large deviations and strong mixing. In Annales de l’IHP Probabilit´es et Statistiques, volume 32, pages 549–569, 1996. [6] Amarjit Budhiraja and Paul Dupuis. Analysis and Approximation of Rare Events: Representations and Weak Convergence Methods, volume 94. Springer, 2019. [7] Amarjit Budhiraja, Paul Dupuis, and Vasileios Maroulas. Large deviations for infinite dimensional stochastic dynamical systems. The Annals of Probability, pages 1390– 1420, 2008. [8] Olivier Catoni. Pac-bayesian supervised classification. Lecture Notes-Monograph Se- ries. IMS, 2007. [9] J-R Chazottes, E Floriani, and R Lima. Relative entropy and identification of Gibbs measures in dynamical systems. Journal of Statistical Physics, 90(3):697–725, 1998. [10] Zhiyi Chi. Stochastic sub-additivity approach to the conditional large deviation prin- ciple. Annals of Probability, pages 1303–1328, 2001. [11] Zhiyi Chi. Conditional large deviation principle for finite state Gibbs random fields. Preprint, 2002. POSTERIORCONSISTENCYINDYNAMICS 37

[12] T Chiyonobu and S Kusuoka. The large deviation principle for hypermixing processes. and Related Fields, 78(4):627–649, 1988. [13] Nicolas Chopin, S´ebastien Gadat, Benjamin Guedj, Arnaud Guyader, and Elodie Vernet. On some recent advances on high dimensional Bayesian statistics. ESAIM: Proceedings and Surveys, 51:293–319, 2015. [14] Francis Comets, Nina Gantert, and Ofer Zeitouni. Quenched, annealed and functional large deviations for one-dimensional in random environment. Probability Theory and Related Fields, 118(1):65–114, 2000. [15] Amir Dembo and Ofer Zeitouni. Large Deviations Techniques and Applications. Sto- chastic Modelling and Applied Probability, 38, 2010. [16] Jean-Dominique Deuschel and Daniel W Stroock. Large deviations, volume 342. American Mathematical Soc., 2001. [17] Persi W. Diaconis and David Freedman. Consistency of Bayes estimates for nonpara- metric regression: normal theory. Bernoulli, 4(4):411–444, 12 1998. [18] IH Dinwoodie and SL Zabell. Large deviations for exchangeable random vectors. The Annals of Probability, pages 1147–1166, 1992. [19] M. D. Donsker and S. R. S. Varadhan. Asymptotic evaluation of certain Markov process expectations for large time, I. Communications on Pure and Applied Mathe- matics, 28(1):1–47, 1975. [20] M. D. Donsker and S. R. S. Varadhan. Asymptotic evaluation of certain Markov process expectations for large time, III. Communications on Pure and Applied Math- ematics, 29(4):389–461, July 1976. [21] M. D. Donsker and S. R. S. Varadhan. Asymptotic evaluation of certain Markov process expectations for large time, IV. Communications on Pure and Applied Math- ematics, 36(2):183–212, March 1983. [22] M. D. Donsker and S. R. S. Varadhan. Large deviations for stationary Gaussian processes. Communications in Mathematical Physics, 97(1-2):187 – 210, 1985. [23] J. L. Doob. Application of the theory of martingales. Colloques Internationaux du Centre National de la Recherche Scientifique, pages 23–27, 1949. [24] Randal Douc, Eric Moulines, Jimmy Olsson, and Ramon Van Handel. Consistency of the maximum likelihood estimator for general hidden Markov models. Annals of Statistics, 39(1):474–513, 2011. [25] Randal Douc, Jimmy Olsson, and Fran¸cois Roueff. Posterior consistency for partially observed Markov models. Stochastic Processes and their Applications, 130(2):733–759, 2020. [26] Paul Dupuis and Richard S Ellis. A weak convergence approach to the theory of large deviations, volume 902. John Wiley & Sons, 2011. [27] Paul Dupuis, Markos A Katsoulakis, Yannis Pantazis, and Petr Plech´ac. Path-space information bounds for uncertainty quantification and sensitivity analysis of stochas- tic dynamics. SIAM/ASA Journal on Uncertainty Quantification, 4(1):80–111, 2016. [28] Peter Eichelsbacher and Ayalvadi Ganesh. Bayesian inference for Markov chains. Journal of Applied Probability, pages 91–99, 2002. [29] Richard S Ellis. Entropy, large deviations, and statistical mechanics. Springer, 2007. [30] Ayalvadi Ganesh and Neil O’Connell. An inverse of Sanov’s theorem. Statistics & Probability Letters, 42(2):201–206, 1999. [31] Ayalvadi J Ganesh and Neil O’connell. A large-deviation principle for Dirichlet pos- teriors. Bernoulli, 6(6):1021–1034, 2000. [32] Elisabeth Gassiat and Judith Rousseau. About the posterior distribution in hidden Markov models with unknown number of states. Bernoulli, 20(4):2039–2075, 2014. [33] Subhashis Ghosal and Aad Van der Vaart. Fundamentals of Nonparametric Bayesian Inference, volume 44. Cambridge University Press, 2017. 38 LANGXUANSUANDSAYANMUKHERJEE

[34] Marian Grendar and George G Judge. A Bayesian large deviations probabilistic inter- pretation and justification of empirical likelihood. Annals of Statistics, 37(5A):2445– 2457, 2009. [35] Peter D. Gr¨unwald and Nishant A. Mehta. Fast rates for general unbounded loss functions: From erm to generalized bayes. Journal of Machine Learning Research, 21(56):1–80, 2020. [36] Shulan Hu and Liming Wu. Large deviations for random dynamical systems and applications to hidden Markov models. Stochastic Processes and their Applications, 121(1):61–90, 2011. [37] Edwin T Jaynes. The well-posed problem. Foundations of Physics, 3(4):477–493, 1973. [38] Yuri Kifer. Large deviations in dynamical systems and stochastic processes. Transac- tions of the American Mathematical Society, 321(2):505–524, 1990. [39] Yuri Kifer. On the topological pressure for random bundle transformations. Transla- tions of the American Mathematical Society-Series 2, 202:197–214, 2001. [40] Yuri Kifer and Pei-Dong Liu. Random dynamics. Handbook of dynamical systems, 1:379–499, 2006. [41] John Frank Charles Kingman. Subadditive ergodic theory. Annals of Probability, 1(6):883–899, 1973. [42] Fran¸cois Ledrappier and Peter Walters. A relativised variational principle for con- tinuous transformations. Journal of the London Mathematical Society, 2(3):568–576, 1977. [43] Artur O Lopes, Silvia RC Lopes, and Paulo Varandas. Bayes posterior convergence for loss functions via almost additive Thermodynamic Formalism. arXiv preprint arXiv:2012.05601, 2020. [44] David A. McAllester. Some pac-bayesian theorems. Machine Learning, 37(3):355–363, Dec 1999. [45] Kevin McGoff, Sayan Mukherjee, and Andrew Nobel. Gibbs posterior convergence and the thermodynamic formalism. Annals of Applied Probability, 2021. [46] Kevin McGoff, Sayan Mukherjee, Andrew Nobel, and Natesh Pillai. Consistency of maximum likelihood estimation for some dynamical systems. Annals of Statistics, 43(1):1–29, 2015. [47] Kevin McGoff and Andrew B Nobel. Empirical risk minimization for dynamical sys- tems and stationary processes. Information and Inference: A Journal of the IMA, iaaa043, 2021. [48] Jeffrey W. Miller and David B. Dunson. Robust bayesian inference via coarsening. Journal of the American Statistical Association, 114(527):1113–1125, 2019. PMID: 31942084. [49] Eric Olivier. Relative entropy, dimensions and large deviations for g-measures. Jour- nal of Physics A: Mathematical and General, 33(4):675, 2000. [50] Steven Orey and Stephan Pelikan. Large deviation principles for stationary processes. The Annals of Probability, 16(4):1481–1495, 1988. [51] F Papangelou. Large deviations and the Bayesian estimation of higher-order Markov transition functions. Journal of applied probability, pages 18–27, 1996. [52] Kalyanapuram Rangachari Parthasarathy. Probability measures on metric spaces, vol- ume 352. American Mathematical Soc., 2005. [53] Karl E Petersen. Ergodic Theory, volume 2. Cambridge University Press, 1989. [54] Firas Rassoul-Agha and Timo Sepp¨al¨ainen. A course on large deviations with an introduction to Gibbs measures, volume 162. American Mathematical Soc., 2015. [55] Luc Rey-Bellet and Lai-Sang Young. Large deviations in non-uniformly hyperbolic dynamical systems. Ergodic Theory and Dynamical Systems, 28(2):587–612, 2008. [56] Lorraine Schwartz. On Bayes procedures. Zeitschrift f¨ur Wahrscheinlichkeitstheorie und verwandte Gebiete, 4(1):10–26, 1965. POSTERIORCONSISTENCYINDYNAMICS 39

[57] Timo Seppalainen. Large deviations for Markov chains with random transitions. The Annals of Probability, pages 713–748, 1994. [58] Cosma Rohilla Shalizi. Dynamics of Bayesian updating with dependent data and misspecified models. Electronic Journal of Statistics, 3:1039–1074, 2009. [59] J Michael Steele. Kingman’s subadditive ergodic theorem. In Annales de l’IHP Prob- abilit´es et statistiques, volume 25, pages 93–98, 1989. [60] Hugo Touchette. The large deviation approach to statistical mechanics. Physics Re- ports, 478(1-3):1–69, 2009. [61] Elodie Vernet. Posterior consistency for nonparametric hidden Markov models with finite state space. Electronic Journal of Statistics, 9(1):717–752, 2015. [62] Volodimir G. Vovk. Aggregating strategies. In Proceedings of the Third Annual Work- shop on Computational Learning Theory, COLT ’90, pages 371–386, San Francisco, CA, USA, 1990. Morgan Kaufmann Publishers Inc. [63] Peter Walters. An Introduction to Ergodic Theory, volume 79. Springer Science & Business Media, 2000. [64] L Wu. Large deviation principle for exchangeable sequences: necessary and sufficient condition. Journal of Theoretical Probability, 17(4):967–978, 2004. [65] Lai-Sang Young. Large deviations in dynamical systems. Transactions of the Ameri- can Mathematical Society, 318(2):525–543, 1990. [66] Arnold Zellner. Optimal Information Processing and Bayes’s Theorem. The American Statistician, 42(4):278–280, 1988.

Appendix A. Proofs A.1. Variational Characterization of Posterior Consistency. To be self-contained, we reproduce the short proof of Theorem 3.1 from [45, The- orem 2].

Proof of Theorem 3.1. Let U be an open neighborhood of Θmin. Then E = Θ \ U is closed and thus a compact subset of Θ. If π0(E) = 0, then πt(E | y) = 0 for any t ≥ 0 and any y ∈ Y. Otherwise, consider the con- ditional prior probability measure πE = π0(· | E) which is supported on E. ∗ Let V = infθ∈Θ V (θ). Since E is disjoint from U, there exists ǫ > 0 such ∗ that infθ∈E V (θ) ≥ V + ǫ. We apply (3.1) on π0 and πE. Then for ν-a.e. y ∈Y, there exists T > 0 large enough such that for all t ≥ T , we have 1 − log Z π0 (y) ≤ V ∗ + ǫ/3, t t 1 Z πE ∗ − log t (y) ≥ inf V (θ) − ǫ/3 ≥ V + 2ǫ/3. t θ∈E As a result, for all t ≥ T , we have 1 π (E | y)= exp −Lt (x,y) dµ (x) dπ (θ) t Z π0 (y) θ θ 0 t ZE×X π (E)Z πE (y)  = 0 t Z π0 t (y) ≤ exp(−V ∗t − (2ǫ/3)t + V ∗t + (ǫ/3)t) ≤ exp(−(ǫ/3)t), which tends to zero as t →∞.  40 LANGXUANSUANDSAYANMUKHERJEE

A.2. Exponentially Continuous Model Family. For the proof of Lemma 3.3, we adapt the argument from [6, Proposition 1.12]. Proof of Lemma 3.3. We write 1 V (t,θ,y) := − log exp(−Lt (x,y)) dµ (x). t θ θ ZX Then for ν-a.e. y ∈Y, limt→∞ V (t,θ,y) =: V (θ), and

lim V (t,θt,y)= V (θ) t→∞ for any θt −→ θ by Definition 3.5. We assume for contradiction that V is not continuous. Then there exists a sequence (θn)n∈N and θ such that θn −→ θ but |V (θn) − V (θ)| > ǫ for some ǫ > 0. Take y ∈ Y from a set of full ν-measure such that limm→∞ V (m,θn,y) = V (θn) for all n and limn→∞ V (n, θ˜n,y) = V (θ), for any θ˜n −→ θ. Then for n ∈ N, there exists mn ≥ n such that |V (mn,θn,y) − V (θn)| < ǫ/2, which implies

|V (mn,θn,y) − V (θ)| > ǫ/2 and a contradiction to the fact that limn→∞ V (mn,θn,y)= V (θ).  For the proof of Proposition 3.4, we adapt the argument from [18, Theo- rem 2.1 & 2.2]. One may also use the approach in [64]. Proof of Proposition 3.4. We first show one side of the inequality. Let θ ∈ supp(π) which is a closed and thus compact subset of Θ, and we take y ∈Y from a set of full ν-measure such that (3.9) and (3.10) hold for θ. We claim that for all ǫ > 0 there exists an open set Uθ containing θ and Tθ ∈ [0, ∞) ′ such that for every θ ∈ Uθ and t ≥ Tθ,

t exp −Lθ′ (x,y) dµθ′ (x) ≥ exp(−t(V (θ)+ ǫ)). ZX  If not, we can find a sequence of parameters θk −→ θ and a subsequence nk ∈ N such that

exp −Lnk (x,y) dµ (x) < exp(−n (V (θ)+ ǫ)), θk θk k X Z   which contradicts to (3.10). Now since Uθ is open and θ ∈ supp(π), π(Uθ) > 0 and so for t ≥ Tθ,

Z π t ′ ′ t (y)= exp −Lθ′ (x,y) dµθ (x) dπ(θ ) ZΘ ZX t  ′ ≥ exp −Lθ′ (x,y) dµθ′ (x) dπ(θ ) ZUθ ZX  ≥ exp(−t(V (θ)+ ǫ)) dπ(θ′) ZUθ = π(Uθ) exp(−t(V (θ)+ ǫ)), POSTERIORCONSISTENCYINDYNAMICS 41 which gives 1 lim inf log Z π(y) ≥−V (θ) − ǫ. t→∞ t t Taking ǫ from a countable sequence that goes to zero and choosing θ to be a minimizer of V on supp(π), we obtain the desired inequality. Next, we show the other side of inequality. Again, let θ ∈ supp(π) and we take y ∈Y from a set of full ν-measure such that (3.10) holds. Let ǫ> 0. By the same argument, there exists an open set Uθ containing θ and Tθ ∈ [0, ∞) ′ such that for every θ ∈ Uθ and t ≥ Tθ,

t exp −Lθ′ (x,y) dµθ′ (x) ≤ exp(−t(V (θ) − ǫ)). ZX  Since supp(π) is compact, it can be covered by a finite cover {Uθk }1≤k≤m . In this case, we take y ∈Y from a set of full ν-measure such that (3.9) and

(3.10) hold for θk, 1 ≤ k ≤ m. Thus, for t ≥ max {Tθ1 ,...,Tθm },

Z π t ′ ′ t (y)= exp −Lθ′ (x,y) dµθ (x) dπ(θ ) ZΘ ZX m  t ′ ≤ exp −Lθ′ (x,y) dµθ′ (x) dπ(θ ) U X k=1 Z θk Z Xm  ′ ≤ exp(−t(V (θk) − ǫ)) dπ(θ ) U k=1 Z θk Xm

= π(Uθk ) exp(−t(V (θk) − ǫ)), Xk=1 which implies

1 Z π lim sup log t (y) ≤ max −V (θk)+ ǫ ≤− inf V (θ)+ ǫ. t→∞ t 1≤k≤m θ∈supp(π) Again, taking ǫ from a countable sequence that goes to zero we obtain the desired inequality. 

A.3. Technical Lemmas for Hypermixing Processes. For the proof of Lemma 4.5, we adapt the arguments in [16, Lemma 5.4.13].

′ Proof of Lemma 4.5. Let ℓ>ℓ0 be given and r = ℓ+r. The goal is to show

′ log exp f nr (x,y) dµ(x) dν(y) Y X Z Z    n ≤ log exp r′α(ℓ)f(x,y) dµ(x) dν(y). α(ℓ) ZY ZX   Indeed, by dividing both sides by nr′ and taking n → ∞, we obtain the ′ desired result. We first show the inequality for a Riemann sum of f nr and 42 LANGXUANSUANDSAYANMUKHERJEE

′ m jr then pass to the integral. Write sj = m . Then by generalized H¨older’s inequality for the product of multiple functions and (H-1),

n−1 ′ m−1 r m ′ exp f ◦ (Φ × Ψ)sj +kr (x,y) dµ(x) X  m  Z Xk=0 Xj=0 m−1 ′ n−1  r m ′ = exp f ◦ (Φ × Ψ)sj +kr (x,y) dµ(x) X m ! Z jY=0 Xk=0 m−1 n−1 1/m m ′ ≤ exp r′ f ◦ (Φ × Ψ)sj +kr (x,y) dµ(x) X ! ! jY=0 Z Xk=0 m−1 n−1 1/mα(ℓ) m ′ ≤ exp r′α(ℓ)f(x, Ψsj +kr y) dµ(x) . X jY=0 kY=0 Z    By taking the logarithm and integrating with respect to ν on both sides,

n−1 ′ m−1 r m ′ log exp f ◦ (Φ × Ψ)sj +kr (x,y) dµ(x) dν(y) Y  X  m   Z Z Xk=0 Xj=0 m−1 n−1    1 m ′ ≤ log exp r′α(ℓ)f(x, Ψsj +kr y) dµ(x) dν(y) mα(ℓ) Y X Xj=0 Xk=0 Z Z    n = log exp r′α(ℓ)f(x,y) dµ(x) dν(y). α(ℓ) ZY ZX   Note that the right hand side does not depend on m. For every (x,y) ∈ X ×Y,

′ m−1 r′ r m ′ ′ lim f ◦ (Φ × Ψ)sj +kr (x,y)= f ◦ (Φ × Ψ)s+kr (x,y) ds, m→∞ m 0 Xj=0 Z and clearly,

n−1 ′ ′ r ′ nr ′ f ◦ (Φ × Ψ)s+kr (x,y) ds = f(Φsx, Ψsy) ds = f nr (x,y). 0 0 Xk=0 Z Z Now as m →∞, by Lebesgue dominated convergence, we obtain

′ log exp f nr (x,y) dµ(x) dν(y) Y X Z Z    n ≤ log exp r′α(ℓ)f(x,y) dµ(x) dν(y). α(ℓ) ZY ZX    POSTERIORCONSISTENCYINDYNAMICS 43

Proof of Lemma 4.6. Note that by the monotone convergnce of relative en- tropy (Theorem B.4), Ft(y) is left continuous in t, so the map t 7→ Ft(y) is determined by the values on those t ∈ Q ∩ [0, ∞). k−1 Let n ∈ N, t1,...,tn ∈ [0, ∞) and sk = (k−1)ℓ+ j=1 tj. Let fk : X → R be a bounded Ftk -measurable function. Then by the variational property of relative entropy (Theorem B.2) and the hypermixingP property (H-1),

− F n (y) (n−1)ℓ+Pk=1 tk n n sk sk ≥ fk ◦ Φ dλy − log exp fk ◦ Φ dµ ZX k=1 ! ZX k=1 ! n X X 1 ≥ f ◦ Φsk dλ − log exp (α(ℓ)f ◦ Φsk ) dµ k y α(ℓ) k k=1 ZX ZX  Xn 1 = f dλ sk − log exp (α(ℓ)f ) dµ k Ψ (y) α(ℓ) k k=1 ZX ZX  X n 1 = α(ℓ)fk dλΨsk (y) − log exp (α(ℓ)fk) dµ , α(ℓ) X X Xk=1 Z Z  −t where we used, by Proposition B.7, that for ν-a.e. y ∈ Y, λy ◦ Φ = λΨty for any t ∈ Q ∩ [0, ∞), and we used the invariance of µ. By taking the supremum of each fk, we obtain n 1 sk −F(n−1)ℓ+Pn t (y) ≥− Ft (Ψ y). k=1 k α(ℓ) k k=1 X  44 LANGXUANSUANDSAYANMUKHERJEE

For the proof of (4.14) in Lemma 4.13, We basically follows the proof of [6, Theorem 1.17]. Proof of (4.14) in Lemma 4.13. Take y ∈ Y from the set of full ν-measure such that (3.9) and (4.13) hold. For every ǫ> 0, 1 lim sup log E exp f t(Xt,y) t→∞ t  1  t t = limsup log E exp f(X ,y) 1{dt(X ,X,y)≤ǫ} n→∞ t t t   t  + E exp f (X ,y) 1{dt(X ,X,y)>ǫ} 1 ≤ lim sup log E exp f t(X,y)+ tǫ   t→∞ t t + exp (tkfk) P dt X ,X,y >ǫ 1 = max lim sup log E exp f t(X,y )+ tǫ ,  t  t→∞  1  kfk + limsup log P d Xt,X,y >ǫ t t t→∞  1   = limsup log E exp f t(X,y) + ǫ = −V (θ)+ ǫ. t→∞ t Similarly,   1 lim inf log E exp f t(Xt,y) t→∞ t  1 t t ≥ lim inf log E exp f (X ,y) 1{d (Xt,X,y)≤ǫ} t→∞ t t 1  t   ≥ lim inf log E exp f (X,y) − tǫ 1{d (Xt,X,y)≤ǫ} t→∞ t t 1    ≥ lim inf log E exp f t(X,y) − tǫ t→∞ t t + − exp (t(kfk− ǫ))P dt X ,X,y >ǫ 1 = lim inf log E exp f t(X,y) − ǫ =−V(θ) − ǫ,   t→∞ t where a+ := max {0, a}, and the last equality follows from the fact that if 1 1 at, bt ≥ 0, lim log at = a in R and lim sup log bt = −∞, then we have t→∞ t t→∞ t 1 + lim inf log(at − bt) = a.  t→∞ t Appendix B. Background and Preliminaries B.1. Probability Measures on Polish Spaces. The following results can be found at [52] and [54, Chapter 4]. Let U be a Polish space. A duality pairing of M(U) and Cb(U) is given by hη, fi = f dη for η ∈ M(U) and f ∈ Cb(U). This duality pairing generates R POSTERIORCONSISTENCYINDYNAMICS 45 a weak topology on M(U) and hence on the closed convex subset M1(U), with a base given by sets of the form

{η ∈ M(U) : |hη, fii − hη0, fii| <ǫ, i = 1,...,m} for ǫ > 0, m ∈ N, η0 ∈ M(U) and fi ∈ Cb(U), 1 ≤ i ≤ m. In particular, M1(U) is a Polish space under this weak topology which also characterizes the weak convergence of probability measures. In the case when U is X or X ×Y, the duality pair of M(U) and Cb,loc(U) with the same pairing also generates the same weak topology on M(U) and M1(U). Definition B.1 (Weak convergence). A sequence of probability measures ηn ∈ M1(U) converges weakly to another probability measure η ∈ M1(U) if f dηn −→ f dη, for every f ∈ Cb(U). TheoremR B.1R (Disintegration of measures). Let U and Y be Polish spaces. Let λ ∈ M1(U×Y) and let ν be the Y-marginal of λ. Then there exists a Borel map Y → M1(U), y 7→ λy such that for every bounded Borel function f : U×Y→ R,

f dλ = f(u, y) dλy(u) dν(y). Z ZY ZU ′ In particular, such a map is unique in the sense that if y 7→ λy is another ′ such map, then λy = λy for ν-a.e. y ∈Y. We also write λ = Y λy⊗δy dν(y). B.2. Properties of Relative Entropy. The following resultsR can be found at [6, Section 2] and [54, Exercise 6.11]. Theorem B.2 (Donsker-Varadhan variational formula). Let U be a Polish space equipped with the Borel σ-algebra, and let λ, η ∈ M1(U). Then we have the following variational formula of the relative entropy:

K(λ k η)= sup f dλ − log exp(f) dη f∈Bb(U) ZU ZU = sup f dλ − log exp(f) dη, f∈Cb(U) ZU ZU where Bb(U) denotes the set of bounded measurable functions on U. Theorem B.3 (Chain rule). Let U and Y be Polish spaces, and let λ, η ∈ M1(U×Y). Let λY and ηY be the Y-marginal of λ and η. Then we have the following chain rule on the relative entropy:

K(λ k η)= K(λY k ηY )+ K(λy k ηy) λY (dy), ZU where λy and ηy are given by the disintegration of λ and η with respect to their Y-marginal, given by Theorem B.1.

Theorem B.4 (Monotone convergence). Let {Ft} be an increasing family of σ-algebras and let F be their limit. Let λ and η be two probability measures on F. Then the relative entropy K(λ|Ft k η|Ft ) is monotonically increasing and converges to K(λ|F k η|F ). 46 LANGXUANSUANDSAYANMUKHERJEE

B.3. Ergodic Theory. We refer to [53, 63] for general introductions to ergodic theory.

B.3.1. Standard Results. The following results can be found in [16, Section 5.2]. Let Z be a Polish space, equipped with the Borel σ-algebra, Rt : Z→Z be an one-parameter family of measurable transformations on Z.

Theorem B.5 (Birkhoff’s pointwise ergodic theorem). Let η ∈ M1(Z,R) be an R-ergodic measure and f ∈ Cb(Z). Then there exists a measurable set E ⊂Z with full η-measure such that for all u ∈ E, 1 t lim f(Rsu) ds = f dη. t→∞ t Z0 ZZ In fact, E can be taken independently of f ∈ Cb(U). See [47, Lemma 7.2].

Theorem B.6 (The Ergodic Decomposition). Let η ∈ M1(Z,R). There exists a Borel probability measure Q on M1(Z,R) such that Q assigns full 1 measure to the set of R-ergodic measures in M1(Z,R), and if f ∈ L (η), 1 then f ∈ L (ρ) for Q-a.e. ρ ∈ M1(Z,R), and

f dη = f dρ dQ(ρ). ZU ZM1(Z,R) ZU We also write η = ρ dQ. B.3.2. Properties ofR Joint Processes. The following results can be found at [40] and [47, Lemma A.2]. Let Z, Y be Polish spaces equipped with the Borel σ-algebras. Let Rt : Z →Z and Ψt : Y →Y be one-parameter families of measurable transformations on Z and Y. Let ν ∈ M1(Y, Ψ) be an Ψ-ergodic measure.

Proposition B.7. Let λ ∈ J (R : ν) with its disintegration λ = Y λy ⊗ δy dν(y) over ν, given by Theorem B.1. Then for any t ≥ 0, we have for t −1 R ν-a.e. y ∈Y, λy ◦ (R ) = λΨty, and so for every f ∈ Cb(Z), we have

t f(R z) dλy(z)= f(z) dλΨty(z). ZZ ZZ Theorem B.8 (Structure of the space of joint processes). J (R : ν) is a non-empty and convex set. If λ ∈ J (R : ν) and λ = ρ dQ is its ergodic decomposition, then Q-a.e. ρ ∈J (R : ν). R Lemma B.9. Let F : J (R : ν) → [0, ∞] be an affine functional in the sense that F (aη + (1 − a)λ) = aF (η) + (1 − a)F (λ) for any a ∈ (0, 1) and η, λ ∈J (R : ν). Then for any λ ∈J (R : ν), we have

F (λ)= F (ρ) dQ(ρ). ZJe(R:ν) POSTERIORCONSISTENCYINDYNAMICS 47

Proof. Since F is affine, if λ ∈ J (R : ν) and λ = ρ dQ is its ergodic decomposition, then by [16, Lemma 5.4.24] and Theorem B.8, R F (λ)= F (ρ) dQ(ρ). ZJe(R:ν) 

Proposition B.10. Let F : J (R : ν) → R be a functional such that

F (λ)= F (ρ) dQ(ρ). ZJe(R:ν) Then we have sup F (λ)= sup F (λ). λ∈J (R:ν) λ∈Je(R:ν) Proof. Straightforward. 

Appendix C. Auxiliary Results in Ergodic Theory This section includes auxiliary subaddive ergodic results that are adapted to hypermixing processes. Let (Y, Ψ,ν) be a dynamical system such that ν is a Ψ-invariant measure. We consider a sequence of measurable functions Ft : Y → [−∞, 0] that are subadditive in the following sense: there exists ℓ0 ≥ 0 and non-increasing α : (ℓ0, ∞) → [1, ∞) such that lim α(ℓ) = 1, ℓ→∞ and for all t1,t2,...,tn ∈ T and ℓ>ℓ0, n k−1 1 (k−1)ℓ+P tj (C.1) F n (y) ≤ Ft (Ψ j=1 y). (n−1)ℓ+Pk=1 tk α(ℓ) k Xk=1 When ℓ = 0 and α(ℓ) = 1, (C.1) is the exactly the subadditive property. n One can relate (C.1) to decomposing the interval of length (n−1)ℓ+ k=1 tk into n subintervals of length t1, t2, . . . , tn that are ℓ separated from each other. P First, we consider the discrete time case T = N, and (Y, Ψ,ν) is a discrete- time dynamical system. We assume that we already showed the limit 1 lim Fk(y) dν(y) k→∞ k ZY exists. The goal is to show

1 Fn(y) Fn(y) lim Fk(y) dν(y)= lim inf dν(y)= lim sup dν(y), k→∞ k n→∞ n n ZY ZY ZY n→∞ Fn(y) which in particular implies that limn→∞ n exists for ν-a.e. y ∈Y and

1 Fn(y) lim Fk(y) dν(y)= lim dν(y). k→∞ k n→∞ n ZY ZY 48 LANGXUANSUANDSAYANMUKHERJEE

We adapt the argument in [1, 59] to prove the following results. The idea is that Ft can be treated as subadditive on those t that are ℓ apart from each other for ℓ>ℓ0, and this extra factor of ℓ is negligible if we consider arbitrarily large time scale k. Proposition C.1.

1 Fn(y) lim Fn(y) dν(y) ≤ lim inf dν(y). n→∞ n n→∞ n ZY ZY Fn Proof. Let F∗ := lim infn→∞ n , which is a Ψ-invariant function. Suppose F∗ ≥ −C for some constant C > 0. Fix ǫ > 0, ℓ>ℓ0 and 1 ≤ k ≤ N ≤ n. Define the set F (y) E = y ∈Y : j ≤ F (y)+ ǫ for some k ≤ j ≤ N . k,N j ∗   t Fix y ∈Y such that F∗(y)= F∗(Ψ y) for all t ∈ N. Note that α(ℓ)F(n−1)ℓ+nk ≤ ′ F(n′−1)ℓ+nk if n ≤ n. In a similar way as in [1,59], we decompose the inter- val of length (n′ − 1)ℓ + nk into classes of subintervals that are ℓ separated from each other, where n′ ≤ n counts the total number of subintervals. One τj class is of the form [τj,τj + lj) for some k ≤ lj ≤ N and Ψ y ∈ Ek,N , one σi c class is of the form [σi,σi + k) where Ψ y ∈ Ek,N , and a remainder interval with length at most max {N, k} = N. Let n0 = 0, and we define inductively a sequence of integers depending on y,0= n0 ≤ n1 ≤ n2 ≤··· by taking

n¯(ℓ+k)+(j−1)ℓ+Nj−1 nj = min n¯ ≥ nj−1 :Ψ y ∈ Ek,N n j o where for convenience, we write N0 = 0 and Nj = i=1 li, and we take l such that k ≤ l ≤ N and F (Ψnj (ℓ+k)+(j−1)ℓ+Nj−1 y) ≤ l (F (y)+ ǫ). j j lj P j ∗ Take r be the largest integer such that knr + Nr ≤ nk. Then by (C.1), the invariance of F∗ on y, and Ft ≤ 0, 1 F (y) ≤ F ′ (y) (n−1)ℓ+nk α(ℓ) (n −1)ℓ+nk

r nj −1 1 ≤ F Ψi(ℓ+k)+(j−1)ℓ+Nj−1 y α(ℓ)2  k Xj=1 i=Xnj−1    nj (ℓ+k)+(j−1)ℓ+Nj−1 + Flj Ψ y + non-positive terms !   r 1 ≤ F Ψnj (ℓ+k)+(j−1)ℓ+Nj−1 y α(ℓ)2 lj Xj=1   1 ≤ (F (y)+ ǫ)N α(ℓ)2 ∗ r 1 ≤ (nkǫ + F (y)N ). α(ℓ)2 ∗ r POSTERIORCONSISTENCYINDYNAMICS 49 where we used the fact that Nr ≤ nk. Note that (n−1)ℓ+nk i 1 c nk − k Ek,N (Ψ y) − N ≤ Nr. Xi=0 By the Ψ-invariance of F∗ and ν,

F∗(y)Nr dν ≤ (nk − N) F∗ dν − ((n − 1)ℓ + nk + 1)k F∗ dν. c ZY ZY ZEk,N Then we have

2 1 α(ℓ) lim F(n−1)ℓ+nk dν ≤ k F∗ dν − (k + ℓ)k F∗ dν + kǫ. n→∞ n Y Y Ec Z Z Z k,N c By taking N →∞, we have ν(Ek,N ) −→ 0, so

2 1 α(ℓ) (k + ℓ) lim Fn dν ≤ k F∗ dν + kǫ. n→∞ n ZY ZY Then the result follows from dividing both sides by k, and taking k → ∞, ℓ →∞ successively. (C) For the general F∗, we define F∗ = max {−C, F∗} for C > 0. The above (C) argument holds by replacing F∗ with F∗ . Then we have by monotone convergence

1 (C) lim Fn dν ≤ lim F∗ dν = F∗ dν. n→∞ n C→∞ ZY ZY ZY 

Lemma C.2. For any k ∈ N, ℓ>ℓ0, for any y ∈Y,

F(n−1)ℓ+nk(y) F (y) lim sup ≥ α(ℓ)(k + ℓ)limsup n n→∞ n n→∞ n

Proof. Fix ℓ>ℓ0. Let n ∈ N. Then n = mn(k+ℓ)−ℓ+r for some mn, r ∈ N such that ℓ ≤ r < k + 2ℓ. By (C.1), 1 1 F (y) ≤ (F (y)+ F (Ψmn(k+ℓ)(y))) ≤ F (y). n α(ℓ) mn(k+ℓ)−ℓ r−ℓ α(ℓ) mn(k+ℓ)−ℓ

mn 1 By taking n →∞, we have mn →∞ and n → k+ℓ , so

F (y) 1 1 Fn(ℓ+k)−ℓ(y) lim sup n ≤ lim sup . n→∞ n α(ℓ) k + ℓ n→∞ n 

Proposition C.3.

1 Fn(y) lim Fk(y) dν(y) ≥ lim sup dν(y). k→∞ k n ZY ZY n→∞ 50 LANGXUANSUANDSAYANMUKHERJEE

Proof. Fix k ∈ N and ℓ>ℓ0. Note that from (C.1), for n ∈ N, n−1 1 (C.2) F (y) ≤ F (Ψi(k+ℓ)y). (n−1)ℓ+nk α(ℓ) k Xi=0 Based on this, for n ∈ N, we define n−1 k i(k+ℓ) Gn(y)= Fk(Ψ y), Xi=0 k+ℓ k which is the n-th Birkohff sum of Fk with respect to Ψ . Clearly, Gn is additive in n. By the subadditive ergodic theorem, k k Gn Gn lim sup dν = lim dν = Fk dν. n n→∞ n ZY n→∞ ZY ZY Then by (C.2) and Lemma C.2,

k n−1 Gn 1 i(k+ℓ) lim sup = limsup Fk ◦ Ψ n→∞ n n→∞ n Xi=0 F(n−1)ℓ+nk ≥ α(ℓ)limsup n→∞ n F ≥ α(ℓ)2(k + ℓ)limsup n . n→∞ n By integrating on both sides, we obtain 1 F F dν ≥ α(ℓ)2 lim sup n dν. k + ℓ k n ZY ZY n→∞ The result follows by taking the limit as k → ∞ on the left hand side and then taking ℓ →∞ on the right hand side.  Now we want to push the subadditive ergodic result to the continuous- time case. Consider now T = [0, ∞) and (Y, Ψ,ν) is a continuous-time dynamical system. The argument is inspired by [41, Theorem 4]. Proposition C.4. For y ∈ Y, we have the following connection between convergence in discrete time and continuous time: F (y) F (y) F (y) F (y) lim inf n ≤ lim inf t ≤ lim sup t ≤ lim sup n , n→∞ n t→∞ t t→∞ t n→∞ n

Ft(y) Fn(y) In particular, limt→∞ t = limn→∞ n exists for ν-a.e. y ∈Y, and F F (y) lim t dν = lim t dν. t→∞ t t→∞ t ZY ZY Proof. Fix ℓ ∈ N with ℓ>ℓ0. Let t be large enough so that we don’t get negative times below, and let n ∈ N be the integer part of t. By (C.1), we have 1 F (y) ≤ (F (y)+ F (Ψny)), t α(ℓ) n−ℓ t−n POSTERIORCONSISTENCYINDYNAMICS 51 1 F (y) ≤ (F (y)+ F (Ψt+ℓy)), n+ℓ+1 α(ℓ) t n+1−t which implies, using the assumption that Ft ≤ 0, 1 α(ℓ)F (y) ≤ F (y) ≤ F (y). n+ℓ+1 t α(ℓ) n−ℓ Now it is clear that the result follows. 

Appendix D. Auxiliary Large Deviation Results D.1. A Weak Large Deviation Upper Bound.

Proof of Lemma 4.9. We basically reproduce the argument of [54, Theorem 4.24] and only need to make sure the result holds for a subset of Y with full ν-measure. ∗ Let K ⊂ M1(X ×Y) be a compact subset. Fix an ǫ> 0 and c< infK p . For every λ ∈ K, there exists f ∈ Cb.loc(X×Y) such that f dλ − p(f) > c. Consider the following open neighborhood of η: R

Bǫ(λ, f)= η ∈ M1(X ×Y) : f dλ − f dη <ǫ .  Z Z 

Since K is compact, it can be covered with B (λ ,f ),...,B (λ ,f ) with ǫ 1 1 ǫ m m corresponding f1,...,fm ∈ Cb,loc(X ×Y). Then for every y ∈Y,

µ(Mt(·, y) ∈ Bǫ(λi,fi))

t t = exp(−tfi (x,y)+ tfi (x,y)) dµ(x) Z{Mt(·,y)∈Bǫ(λi,fi)} t ≤ exp t − fi dλi + ǫ exp(tfi (x,y)) dµ(x).   Z  Z{Mt(·,y)∈Bǫ(λi,fi)} By Theorem 4.2, for ν-a.e. y ∈ Y independent of fi, hence independent of K, 1 lim sup log µ(M (·,y) ∈ B (λ ,f )) ≤− f dλ + ǫ + p(f ) ≤−c + ǫ. t t ǫ i i i i i t→∞ Z Hence, for ν-a.e. y ∈Y independent of K, 1 1 lim sup log µ(Mt(·,y) ∈ K) ≤ max lim sup log µ(Mt(·,y) ∈ Bǫ(λi,fi)) t→∞ t 1≤i≤m t→∞ t ≤−c + ǫ. ∗ We take ǫ −→ 0 and c −→ infK p to yield the desired result. That p∗(λ) = ∞ if λ is not (Φ × Ψ)-invariant follows from the same proof as in [16, p. 218]. The last statement is also straightforward: if the Y-marginal of λ is not ν, then for any N > 0, there exists f ∈ Cb(Y) such that f dλ − f dν ≥ N, so by Birkhoff’s pointwise ergodic theorem, p∗(λ) ≥ f dλ − p(f) ≥ N.  R R R 52 LANGXUANSUANDSAYANMUKHERJEE

D.2. The Exponential Tightness. To obtain the full large deviation up- per bound, we need to show that for X ∼ µ and ν-a.e. y ∈Y, (Mt(X,y))t∈T is exponentially tight, i.e., for all a > 0, there exists a compact subset K ⊂ M1(X ×Y) such that

1 c lim sup log µ (Mt(X,y) ∈ K ) ≤−a. t→∞ t The following generic lemma allows us to push exponential tightness from M1(X ) to M1(X ×Y).

Lemma D.1. Suppose X ∼ µ, (Mt(X))t∈T is an exponentially tight family of M1(X )-valued random variables. Then for ν-a.e. y ∈ Y, (Mt(X,y))t∈T is an exponentially tight family of M1(X ×Y)-valued random variables.

Proof. Note that by Theorem B.5, for ν-a.e. y ∈ Y, Mt(y) −→ ν, so {Mt(y)}t has compact closure. Let y ∈ Y be such one and KY be the closure of {Mt(y)}t, which is a compact subset of M1(Y). Fix a > 0. By exponential tightness of (Mt(X))t, there exists a compact subset KX ⊂ X such that 1 c lim sup log µ (Mt(X) ∈ KX ) ≤−a. t→∞ t Consider the set K = {λ ∈ M1(X ×Y) : λX ∈ KX , λY ∈ KY }, where λX denotes the X -marginal of λ, and similar for λY . It is easy to check K ⊂ M1(X ×Y) is closed and tight, hence compact by Prokhorov’s Theorem. Then we obtain 1 c lim sup log µ (Mt(X,y) ∈ K ) t→∞ t 1 c c ≤ lim sup log µ (Mt(X) ∈ KX )+ δy Mt(y) ∈ KY t→∞ t 1 c  = limsup log (µ (Mt(X) ∈ KX )) ≤−a. t→∞ t 

Department of Mathematics, Duke University, Durham, NC 27708 Email address: [email protected]

Departments of Statistical Science, Mathematics, Computer Science, Bio- statistics & Bioinformatics, Duke University, Durham, NC 27708 Email address: [email protected]