Hidden Gibbs Models: Theory and Applications

– DRAFT –

Evgeny Verbitskiy

Mathematical Institute, Leiden University, The Netherlands

Johann Bernoulli Institute, Groningen University, The Netherlands

November 21, 2015 2 Contents

1 Introduction1 1.1 Hidden Models / Hidden Processes...... 3 1.1.1 Equivalence of “random” and “deterministic” settings. 4 1.1.2 Deterministic chaos...... 4 1.2 Hidden Gibbs Models...... 6 1.2.1 How realistic is the Gibbs assumption?...... 8

I Theory9

2 Gibbs states 11 2.1 Gibbs-Bolztmann Ansatz...... 11 2.2 Gibbs states for Lattice Systems...... 12 2.2.1 Translation Invariant Gibbs states...... 13 2.2.2 Regularity of Gibbs states...... 15 2.3 Gibbs and other regular stochastic processes...... 16 2.3.1 Markov Chains...... 17 2.3.2 k-step Markov chains...... 18 2.3.3 Variable Length Markov Chains...... 19 2.3.4 Countable Mixtures of Markov Chains (CMMC’s)... 20 2.3.5 g-measures (chains with complete connections).... 20 2.4 Gibbs measures in Dynamical Systems...... 27 2.4.1 Equilibrium states...... 28

3 Functions of Markov processes 31 3.1 Markov Chains (recap)...... 31 3.1.1 Regular measures on Subshifts of Finite Type...... 33 3.2 Function of Markov Processes...... 33 3.2.1 Functions of Markov Chains in Dynamical systems.. 35 3.3 Hidden Markov Models...... 36 3.4 Functions of Markov Chains from the Thermodynamic point of view...... 38

3 4 CONTENTS

4 Hidden Gibbs Processes 45 4.1 Functions of Gibbs Processes...... 45 4.2 Yayama...... 45 4.3 Review: renormalization of g-measures...... 47 4.4 Cluster Expansions...... 47 4.5 Cluster expansion...... 49

5 Hidden Gibbs Fields 53 5.1 Renormalization Transformations in .. 54 5.2 Examples of pathologies under renormalization...... 55 5.3 General Properties of Renormalised Gibbs Fields...... 60 5.4 Dobrushin’s reconstruction program...... 60 5.4.1 Generalized Gibbs states/“Classification” of singularities 61 5.4.2 Variational Principles...... 64 5.5 Criteria for Preservation of Gibbs property under renormal- ization...... 70

II Applications 73

6 Hidden Markov Models 75 6.1 Three basic problems for HMM’s...... 75 6.1.1 The Evaluation Problem...... 77 6.1.2 The Decoding or State Estimation Problem...... 78 6.2 Gibbs property of Hidden Markov Chains...... 78

7 Denoising 81

A Prerequisites 89 A.1 Notation...... 89 A.2 theory...... 89 A.3 Stochastic processes...... 92 A.4 ...... 93 A.5 ...... 94 A.5.1 Shannon’s entropy rate per symbol...... 94 A.5.2 Kolmogorov-Sinai entropy of measure-preserving sys- tems...... 95 Chapter 1

Introduction

Last modified on April 3, 2015 Very often the true dynamics of a physical system is hidden from us: we are able to observe only a certain measurement of the state of the underly- ing system. For example, air temperature and the speed of wind are easily observable functions of the climate dynamics, while electroencephalogram provides insight into the functioning of the brain.

Partial Observability Noise Coarse Graining

Observing only partial information naturally limits our ability to describe or model the underlying system, detect changes in dynamics, make pre- dictions, etc. Nevertheless, these problems must be addressed in practical

1 2 CHAPTER 1. INTRODUCTION

situations, and are extremely challenging from the mathematical point of view. has proved to be extremely useful in dealing with imperfect data. There are a number of methods developed within various mathe- matical disciplines such as

• Dynamical Systems

• Control Theory

• Decision Theory

• Information Theory

to address particular instances of this general problem – dealing with imperfect knowledge. Remarkably, many problems have a common nature. For example, cor- recting signals corrupted by noisy channels during transmission, analyz- ing neuronal spikes, analyzing genomic sequences, and validating the so- called renormalization group methods of theoretical physics, all have the same underlying mathematical structure: a hidden Gibbs model. In this model, the observable process is a function (either deterministic or ran- dom) of a process whose underlying probabilistic law is Gibbs. Gibbs mod- els have been introduced in Statistical Mechanics, and have been highly successful in modeling physical systems (ferromagnets, dilute gases, poly- mer chains). Nowadays Gibbs models are ubiquitous and are used by researchers from different fields, often implicitly, i.e., without knowledge that a given model belongs to a much wider class. A well-established Gibbs theory provides an excellent starting point to develop Hidden Gibbs theory. A subclass of Hidden Gibbs models is formed by the well-known Hidden Markov Models. Several examples have been considered in Statistical Mechanics and the theory of Dynamical Systems. However, many basic questions are still unanswered. Moreover, the relevance of Hidden Gibbs models to other areas, such as Information Theory, Bioinformatics, and Neuronal Dynamics, has never been made explicit and exploited. Let us now formalise the problem, first by describing the mathematical paradigm to model partial observability, effects of noise, coarse-graining, and then, introducing the class of Hidden Gibbs Models. 1.1. HIDDEN MODELS / HIDDEN PROCESSES 3

1.1 Hidden Models / Hidden Processes

Let us start with the following basic model, which we will specify further in the subsequent chapters.

The observable process {Yt} is a function of the hidden time- dependent process {Xt}.

We allow the process {Xt} to be either

• a stationary random (stochastic) process,

• a deterministic dynamical process

Xt+1 = f (Xt).

For simplicity, we will only consider processes with discrete time, i.e., t ∈ Z or Z+. However, many of what we discuss applies to continuous time processes as well.

The process {Yt} is a function of {Xt}. The function can be

• deterministic Yt = f (Xt) for all t,

• or random: Yt ∼ PXt (·), i.e. Yt is chosen according to some probabil- ity distribution, which depends on {Xt}.

Remark 1.1. If not stated otherwise, we will implicitly assume that in case Yt is a random function of the underlying hidden process Xt, then Yt is chosen independently for every t. For example,

Yt = Xt + Zt, where the “noise” {Zt} is a sequence of independent endemically dis- tributed random variables. In many practical situations discussed below one can easily allow for the “dependence", e.g., {Zt} is a Markov process. Moreover, as we will see latter, the case of a random function can reduced to the deterministic case, hence, any {Zt} can be used to model noise. However, despite the fact the models are equivalent (see next subsection), it is often convenient to keep implicit dependence on the noise parameters. 4 CHAPTER 1. INTRODUCTION

1.1.1 Equivalence of “random” and “deterministic” settings.

There is one-to-one correspondence between stationary processes and measure-preserving dynamical systems.

Proposition 1.2. Let (Ω, A, µ, T) be a measure-preserving dynamical system. Then for any measurable ϕ : Ω → R,

(ω) t Yt = ϕ(T ω), ω ∈ Ω, t ∈ Z+ or Z,(1.1.1)

is a stationary process. In the opposite direction, any real-valued stationary pro- cess {Yt} gives rise to a measure preserving dynamical system (Ω, A, µ, T) and a measurable function φ : Ω → R such that (1.1.1) holds.

 Exercise 1.3. Prove Proposition 1.2.

1.1.2 Deterministic chaos

Chaotic systems are typically defined as systems where trajectories de- pend sensitivelyon the initial conditions, that is, if small difference in ini- tial conditions results, over time, in substantial differences in the states: small causes can produce large effects. Simple observables (functions) {Yt} of trajectories of chaotic dynamical systems {Xt}, Xt+1 = f (Xt), can be indistinguishable from “random” processes.

Example 1.4. Consider a piece-wise expanding map f : [0, 1] → [0, 1],

f (x) = 2x mod 1,

and the following observable φ : ( 0, x ∈ [0, 1 ] =: I , ( ) = 2 0 φ x 1 1, x ∈ ( 2 , 1] := I1.

Since the map f is expanding: for x, x0 close,

d( f (x), f (x0)) = 2d(x, x0),

the dynamical system is chaotic. Furthermore, one can easily show that n the Lebesgue measure on [0, 1] is f -invariant, and the process Yn = φ( f (x)) is actually a .

Theorem 1.5 (Sinai’s Theorem on Bernoulli factors). 1.1. HIDDEN MODELS / HIDDEN PROCESSES 5

At the same time, often a looks "completely random", while it has a very simple underlying dynamical structure. In the figure below, the Gaussian is compared visually vs the so-called determin- istic Gaussian white noise. The reconstruction techniques (c.f., Chapter ??) easily allow to distinguish the time series, as well as to identify the underlying dynamics.

Figure 1.1: (a) Time series of the Gaussian and deterministic Gaussian white noise. (b) Corresponding probability densities of single observations. (c) Reconstruction plots Xn+1 vs Xn.

Many vivid examples of close resemblance between the "random" and "deterministic" processes can be found in the literature. For example, the following figure is taken from Daniel T. Kaplan, Leon Glass, Physica D 64, 431-453 (1993) demonstrates visual similarity between several natural and surrogate models. 6 CHAPTER 1. INTRODUCTION

1.2 Hidden Gibbs Models

The key assumption of the model is that the observable process is a function, either deterministic or random, of the underlying process, whose governing probability distribution belongs to the class of Gibbs distributions.

(i) The most famous example covered by the proposed dynamic probabilis- tic models are the so-called Hidden Markov Models (HMM). Suppose {Xn} is a Markov process, assuming values in a finite state space A. In HMM’s, the observable output process Yn with values in B, is dependent only on Xn, and the value Yn ∈ B is chosen according to the so-called emission distributions:

Πij = P(Yn = j | Xn = i), i ∈ A, j ∈ B,(1.2.1) for each n independently. The Hidden Markov models can be found in a broad range of applications: from speech and handwriting recognition, to 1.2. HIDDEN GIBBS MODELS 7 bioinformatics and signal processing.

(ii) A Binary Symmetric Channel (BSC) is a widely used model of a communication channel in coding theory and information theory. In this model, a transmitter sends a bit (a zero or a one), and the receiver receives a bit. Typically, the bit is transmitted correctly, but, with a 0 1 − e 0 small probability e, the bit will be e inverted or flipped.We can represent the action of the channel as 1 1 − e 1

Yn = Xn ⊕ Zn, where ⊕ is the exclusive OR opera- tion, and Zn is the state of the channel: 0 – transmit "as is", 1 – flip. The channel can be memoryless: in this case, {Zn} is a sequence of independent Bernoulli random variables with P(Zn = 1) = e, or the channel can have memory, e.g., if {Zn} is Markov (Gilbert-Elliott channel, suitable to model burst errors). If {Xn} is a binary Markov process, then this model is a particular example of a .

(iii) Neuronal spikes are sharp pronounced deviations from the base- line in recordings of neuronal activity. Often, the electrical neuronal ac- tivity is transformed into 0/1 processes, indicating the presence of ab- sence of spikes. These temporal binary processes – spike trains, are clearly functions of the underlying neuronal dynamics. In the field of Neural Coding/Decoding, the relationship between the stimulus and the result- ing spike responses is investigated. Another, particularly interesting type of spike processes originates from the so-called interictal (between the seizures) spikes. The presence of interictal spikes is used diagnostically as a sign of epilepsy. 8 CHAPTER 1. INTRODUCTION

Figure 1.2: (a) The spike trains emitted by 30 neurons in the visual area of a monkey brain are shown for a period of 4 seconds. The spike train emitted by each of the neurons is recorded in a separate row. (source W.Maas, http://www.igi.tugraz.at/maass/). (b) EEG recordings of a pa- tient with Rolandic epilepsy provided by the Epilepsy Centre Kempen- haeghe. Epileptic interictal spikes are clearly visible in the last 3 epochs.

(iv) The decimation is a classical example of renormalization transforma- tion in Statistical Mechanics. Consider a d-dimensional lattice system Ω = d AZ , with the single-spin space A. The image of a spin configuration T2 x ∈ Ω, under the decimation trans- −→ formation with spacing b, is again a spin configuration y in Ω, y = Tb(x), given by d yn = xbn for all n ∈ Z . Under the decimation transformation, only every bd-th spin survives. Fig- ure on the right depicts the action of the decimation transformation T2 on the two-dimensional integer lattice Z2.

1.2.1 How realistic is the Gibbs assumption?

ADVANTAGES OF GIBBS ASSUMPTION Part I

Theory

9

Chapter 2

Gibbs states

Last modified on September 23, 2015 Existing literature on rigorous mathematical theory of Gibbs states is quite extensive. The classical references are the books of Ruelle [18] and Georgii [7, 8]. In the context of dynamical systems, the book of Bowen [11, 12] provides a nice introduction to the subject.

2.1 Gibbs-Bolztmann Ansatz

The Gibbs distribution prescribes the probability of the event that the system with the finite state space S is in state x ∈ S, to be equal to 1  1  P(x) = exp − H(x) ,(2.1.1) Z kT where H(x) is the energy of the state x, k is the Boltzmann constant, T is the temperature of the system, and  1  Z = ∑ exp − H(y) y∈S kT is a normalising factor, known as the partition function. Exercise. Show that the P on S given by (2.1.1) is precisely the unique probability measure on S which maximizes the fol- lowing functional, defined on probability measures on S by

P 7→ − ∑ P(x) log P(x) − β ∑ H(x)P(x) = Entropy − β · Energy, x∈S x∈S where β = 1/kT is the inverse temperature.

11 12 CHAPTER 2. GIBBS STATES

2.2 Gibbs states for Lattice Systems

Equation (2.1.1) – known as the Gibbs-Boltzmann Ansatz – makes perfect sense for systems where the number of possible states is finite. Typically for systems with infinitely many states, each state occurs with probability zero, and a new interpretation of (2.1.1) is required. This problem was successfully solved in late 1960’s by R.L. Dobrushin [2], O. Lanford and D. Ruelle [3]. The key is to identify Gibbs states as those probability measures on the infinite state space which have very specific finite volume conditional probabilities, parametrized by boundary conditions, and given by expressions similar to (2.1.1). The definition of Gibbs measures is applicable to rather general lattice sys- tems Ω = AL, where A is a finite or a compact spin-space, and L is a lattice. For simplicity, let us assume that A is finite, and L is an integer lattice Zd, d ≥ 1. Elements ω of Ω are functions on Zd with values in A, equivalently, at each site n d in Z , “sits" a spin ωn with a value in A. If ω ∈ Ω, and Λ is d a subset of Z , ωΛ will denote the restriction of ω to Λ, i.e., Λ d ωΛ ∈ A . Fix a finite volume Λ is Z , and some configura- ωΛ σΛc tion ωΛc outside of Λ, we will refer to ωΛc as the boundary conditions. Gibbs measures are defined by prescribing the con- ditional probability of observing configuration σΛ in the finite volume Λ, given the boundary conditions ωΛc 1 −βHΛ(σΛω c ) P(σΛ | ωΛc ) = e Λ =: γΛ(σΛ|ωΛc ),(2.2.1) Z(ωΛc ) 1 where β = kT is the inverse temperature, HΛ is the corresponding energy, and the normalising factor – partition function Z, now depends on the boundary conditions ωΛc . The analogy with (2.1.1) is clear.

Naturally, the family of conditional probabilities Γ = {γΛ(·|σΛc )}, called specification, indexed by finite sets Λ and boundary conditions σΛc , can- not be arbitrary, and must satisfy some consistency conditions. In order to ensure the consistency of the family of probability kernels {γΛ}, the energies HΛ must be of a very special form, namely:

HΛ(σΛωΛc ) = ∑ Ψ(V, σΛωΛc ),(2.2.2) V∩Λ6=∅ where the sum is taken over all finite subsets V of Zd, and the collection of functions

Ψ = {Ψ(V, ·) : Ω → R : V ⊂ Zd, |V| < ∞} 2.2. GIBBS STATES FOR LATTICE SYSTEMS 13 called the interaction, has the following properties:

• (locality): for all η ∈ Ω, Ψ(V, η) = Ψ(V, ηV), meaning that the con- tribution of volume V to the total energy only depends on the spins in V.

• (uniform absolute convergence):

|||Ψ||| = sup ∑ sup |Ψ(V, ηV)| < +∞. n∈Zd V3n η∈Ω

d Definition. A probability measure P on Ω = AZ is called a for the interaction Ψ = {Ψ(Λ, ·) | Λ b Zd} (denoted by P ∈ G(Ψ) if (2.2.1) holds P-a.s., equivalently, if the so-called Dobrushin-Lanford-Ruelle equations hold for all f ∈ L1(Ω, P) ( f ∈ C(Ω, R)) and all Λ b Zd Z Z f (ω)P(dω) = (γΛ f )(ω)P(dω), Ω Ω where (γΛ f )(ω) = ∑ γΛ(σΛ|ωΛc ) f (σΛωΛc ). Λ σΛ∈A This is equivalent to saying that

EP( f |FΛc ) = γΛ f ,

c where FΛc is the σ-algebra generated by spins in Λ .

Theorem 2.1. If Ψ is a uniformly absolutely convergent interaction, then the set of Gibbs measures for Ψ is not empty.

Thus the Dobrushin-Lanford-Ruelle (DLR) theory guarantees existence of at least one Gibbs state for any interaction Ψ as above, i.e., a Gibbs state consistent with the corresponding specification. It is possible that multiple Gibbs states are consistent with the same specification; in this case, one speaks of a phase-transition.

2.2.1 Translation Invariant Gibbs states

d Lattice (group) Zd acts on AZ by translations:

Zd 3 k → Sk, 14 CHAPTER 2. GIBBS STATES where S k(ω) = ω˜ with

d ω˜ n = ωn+k for all n ∈ Z ,

d i.e. configuration ω is shifted by k. Recall, that the measure P on Ω = AZ is called translation invariant if

− P(S kA) = P(A)

for every Borel set A. When are Gibbs states translation invariant? What are sufficient conditions for existence of translation invariant Gibbs states?

Definition 2.2. An interaction Φ = {φ(Λ, ·) | Λ b Zd} is called transla- tion invariant if

k k φΛ+k(ω) = φΛ(S ω)(φΛ+k = φΛ ◦ S )

d for all Λ b Zd, k ∈ Zd, and every ω ∈ Ω = AZ . Example 2.3 ( of interacting spins). The alphabet A = {−1, 1} represents two possible spin values ”up" and “down", and the interaction is defined as follows  −hω , if Λ = {i},  i φΛ(ω) = −Jωiωj, if Λ = {i, j} and i ∼ j,  0, otherwise

The notation i ∼ j indicates that sites i and j are nearest neighbours, i.e., ||i − j||1 = 1. Parameterh is the strength of the external magnetic field, and the coupling constant J determines whether the interaction is ferromagnetic (J > 0) or anti-ferromagnetic (J < 0). In a ferromagnetic case, spins “prefer” to be aligned, i.e., aligned configurations have higher probabilities. The spins also prefer to line up with the external field .

If φ is translation invariant, then so is the specification

k k γΛ(S σ|S ω) = γΛ+k(σ|ω).

Theorem 2.4. If γ = {γΛ(·|·)} is a translation invariant specification, then the set of translation invariant Gibbs measures for γ is not empty. 2.2. GIBBS STATES FOR LATTICE SYSTEMS 15

2.2.2 Regularity of Gibbs states

If the interaction Ψ is as above, then the corresponding specification Γ = {γΛ(·|·)} has the following important regularity properties:

(i) Γ is uniformly non-null or finite energy: the probability kernels γΛ(·|·) are uniformly bounded away from 0 and 1, i.e., for every finite set Λ, there exist positive constants αΛ, βΛ such that

0 < αΛ ≤ γ(σΛ|ωΛc ) ≤ βΛ(< 1)

for all σΛωΛc ∈ Ω. (ii) Γ is quasi-local: for any Λ, the kernels are continuous in product topology: i.e., as V % Zd

sup γΛ(ωΛ|ηV\ΛζVc ) − γΛ(ωΛ|ηV\ΛχVc ) → 0, ω,η,ζ,χ in other words, changing remote spins (spins in Vc), has diminishing effect of the distribution of the spins in Λ.

Consider now the inverse problem: suppose P is a probability measure on Ω. Can we decide whether it is Gibbs, equivalently, if it is consistent with a Gibbs specification Γ = ΓΨ for some Ψ. Naturally, for a measure to be Gibbs, it must, at least, be consistent with some uniformly non-null quasi-local specification Γ = {γΛ(·|·)}Λ:

P(σΛ | ωΛc ) = γΛ(σΛ|ωΛc ) (2.2.3) for P-a.a. ω. The necessary and sufficient conditions for (2.2.3) to hold are given by the following theorem.

d Theorem 2.5. A measure P on Ω = AZ with full support is consistent with some some uniformly non-null quasi-local specification if and only if for every d Λ b Z and all σΛ, P(σΛ|ωV\Λ) converge uniformly in ω as V % Zd.

Suppose now, the probability measure P passed the test of Theorem 2.5, and is thus consistent with some uniformly non-null quasi-local specifi- cation Γ = {γΛ(·|·)}Λ. Is it possible to find an interaction Ψ such that Γ = ΓΨ, i.e.,

1 Ψ −H (σΛω c ) Ψ γ(σΛ|ωΛc ) = e Λ Λ =: γ (σΛ|ωΛc ), Z(ωΛc ) 16 CHAPTER 2. GIBBS STATES with Ψ HΛ(σΛωΛc ) = ∑ Ψ(V, σΛωΛc ) V∩Λ6=∅ depending on the answer, the measure P will or will not be Gibbs. It turns out that the answer, given by Kozlov [5] and Sullivan [6], is always positive. Theorem 2.6. If Γ = {γ(·|·)} is a uniformly non-null quasi-local specification then there exists an uniformly absolutely convergent interaction Ψ such that

Ψ γ(σΛ|ωΛc ) = γ (σΛ|ωΛc ).

Equivalently, if P is consistent with some uniformly non-null quasi-local speci- fication, then there exists a uniformly absolutely convergent interaction Ψ, such that P is a Gibbs state for Ψ.

However, in case the uniformly non-null quasi-local specification is transla- tion invariant, the corresponding interaction is not necessarily translation invariant. There is a procedure to derive a translation invariant interaction, consistent with a given translation invariant specification, but the resulting interaction is not necessarily uniformly absolutely convergent, but does satisfy weaker summability condition. In fact it is an open problem in the theory of Gibbs states: Open problem: Does every translation invariant uniformly non-null quasi- local specification allow a Gibbs representation for some translation in- variant uniformly absolutely convergent interaction. The theory of Gibbs states is extremely rich [7, 18]. In particular, there is a dual possibility to characterize Gibbs states locally (via specifications) or globally (via variational principles). A variety of probabilistic techniques is available for Gibbs distributions: limit theorems, coupling techniques, cluster expansions, large deviations.

2.3 Gibbs and other regular stochastic processes

Gibbs probability distributions have been highly useful in describing phys- ical systems (ferromagnets, dilute gases, polymers). However, the intrin- sically multi-dimensional theory of Gibbs random fields is of course ap- plicable to one-dimensional random processes as well. To define a Gibbs random process {Xn} we have to prescribe two-sided conditional distribu- tions 1 1 +∞ +∞ P(X0 = a0|X−∞ = a−∞, X1 = a1 ), ai ∈ A,(2.3.1) 2.3. GIBBS AND OTHER REGULAR STOCHASTIC PROCESSES 17 i.e., conditional distribution of the present given the complete past and the complete future. Defining processes via the one-sided conditional probabilities 1 1 P(X0 = a0|X−∞ = a−∞), using only past values, is probably more popular. In fact, the full Gibbsian description has rarely been used outside the realm of Statistical Physics. Nevertheless, two-sided descriptions have definite advantages in some applications as we will demonstrate in subsequent chapters. On the other hand, the idea of using in some sense regular conditional probabilities to define random process is very natural and has been em- ployed on multiple occasions in various fields. Let us now review some of the most popular regular one-sided models. As we have seen in preceding sections, Gibbs measures are characterized by regular (quasi-local and non-null) specifications: two-sided conditional probabilities. At the same time, many probabilistic models have introduced in the past 100 years which require some form of regularity of one-sided conditional probabilities. In many cases, it turns out that one-sided regularity also im- plies the two-sided regularity, and hence these processes are also Gibbs.1 Let us review some of these models.

2.3.1 Markov Chains

Introduced by Markov in early 1900’s.

Definition 2.7. The process {Xn}n≥0 with values in A, |A| < ∞, is Markov if it has the so-called Markov property:

P(Xn = an|Xn−1 = an−1, Xn−2 = an−2,... ) = P(Xn = an|Xn−1 = an−1) (2.3.2) for all n ≥ 1 and all an, an−1, · · · ∈ A, i.e., only immediate past has influ- ence on the present.

Clearly, (2.3.2) is a very strong form of regularity of conditional probabili- ties.

1From a practical point of view, studying functions of “one-sided regular” processes is also important. In fact, the basic question: what happens to functions of regular processes. 18 CHAPTER 2. GIBBS STATES

We will mainly be interested in stationary Markov processes, or equiva- lently, in time-homogeneous chains:

P(Xn = an|Xn−1 = an−1) = pan−1,an

depends only on symbols an−1, an, and not n. Time-homogeneous Markov chains are parametrized by the transition probability matrices: non-negative stochastic matrices   P = pa a0 , pa,a0 ≥ 0, pa,a0 = 1. , a,a0∈A ∑ a,a0

And hence

n−1 ( = = = ) = ( = | = ) · ( = ) P X0 a0, X1 a1,..., Xan an ∏ P Xi+1 ai+1 Xi ai P X0 a0 i=1

(telescoping).

The law P of the Markov process (chain) {Xn} is shift-invariant if the initial distribution   π = P(X0 = a1) ... P(X0 = am) , A = {a1,..., am},

satisfies πP = π.

Remark 2.8. The law P of a stationary (shift-invariant) Z+ gives us a measure on Ω = A = {ω = (ω0, ω1,... ) : ωi ∈ A}. Any translation invariant measure on Ω = AZ+ can be extended in a unique way to a measure on Ω = AZ (by translation invariance we can determine measure of any cylindric event).

 0 Exercise 2.9. Suppose P = (pa,a )a,a0∈A is a strictly positive stochastic matrix. Then there exists a unique initial distribution π which gives rise to a stationary Markov chain. Moreover, the corresponding stationary law P, when extended to Ω = AZ, is Gibbs. Identify a Gibbs interaction Ψ of P.

2.3.2 k-step Markov chains

Sometimes Markov chains can be non-adequate as models of processes of interest, e.g., when process has longer “memory”. For these situations, 2.3. GIBBS AND OTHER REGULAR STOCHASTIC PROCESSES 19

Markov chains with longer memory — k-step Markov chain have been introduced. They are characterized by the k-step :

P(Xn = an|Xn−1 = an−1, Xn−2 = an−2,... ) = P(Xn = an|Xn−1 = an−1,..., Xn−k = an−k) for all n ≥ k and all an, an−1, · · · ∈ A. One can show that any stationary process can be well approximated by k-step Markov chains as k → ∞. (Markov measures are dense among all shift-invariant measures.) The main drawback of using k-step Markov chains in applications is the neces- sity to estimate large (= |A|k · (|A| − 1)) number of parameters (= elements  = 0 of P p~a,a ~a∈Ak,a0∈A).

2.3.3 Variable Length Markov Chains

Variable Length Markov Chains (VLMC’s), also known as probabilistic tree sources in Information Theory, are characterized that conditional proba- bilities P(Xn = an|Xn−1 = an−1, Xn−2 = an−2,... ) depend on finitely many preceding symbols, but the number of such sym- bols in varying, depending on the context. More specifically, there exists a function − ` : A N → Z+ (or, {0, 1, . . . , N}) such that

P(Xn = an|Xn−1 = an−1, Xn−2 = an−2,... ) n−1 n−1 = (Xn = an|X = a ), P n−`(an−1,an−2,... ) n−`(~a) if `(~a) = 0, then P(Xn = an|X

Conditional probabilities (2.3.3) can be visualized with a help of a probabilistic tree : each terminal node carries (is assigned) a probability distribu- tion on {0, 1} and for a given context (a−1, a−2,...) the actual distribution is found by performing the corresponding walk by following the prescribed edges on the tree until a terminal node is reached.

2.3.4 Countable Mixtures of Markov Chains (CMMC’s)

We have ∞   | = (0) + (k) | −1 P (a0 a<0) λ0P (a0) ∑ λkP a0 a−k ,(2.3.4) k=1 (k) where P is an order-k transition probability and ∑k≥0 λk = 1. In such chains (2.3.4), the next symbol can be generated using the com- bination of two independent random steps. Firstly, with probabilities λk, the k-the Markov chain is selected, and then the next symbol is chosen according to P(k). Thus, each transition depends on finite, but random number of preceding symbols. Note that conditional probabilities (2.3.4) are continuous in the following sense.

 −1 −1 −k−1 −k−1 P X0 = a0|X−k = a−k, X−∞ = b−∞   − = | −1 = −1 −k−1 = −k−1 ≤ → → P X0 a0 X−k a−k, X−∞ c−∞ ∑ λm 0 as k ∞. m>k

2.3.5 g-measures (chains with complete connections)

The class of g-measures has been Introduced by M. Keane in 1972 [17]. These measures are characterized by regular one-sided conditional prob- abilities. In fact, similar definitions has already been given by various authors under a variety of names: e.g., chains with complete connections (cha´ines a` liaisous comple`tes) by Onicescu and Mihoc (1935), Doeblin- Fortet [14] established first uniqueness results, chains of infinite oder by Harris (1995). In our presentation we use the notation inspired by dynam- ical systems, might be strange from a probabilistic point of view, but is completely equivalent to a more familiar probabilistic point of defining processes, given the past, discussed in the previos sections. 2.3. GIBBS AND OTHER REGULAR STOCHASTIC PROCESSES 21

Let A be a finite set, and put

Z+ Ω = A = {ω = (ω0, ω1,...) : ωn ∈ A for all n ∈ Z+}

Denote by G = GΩ the class of continuous (in the product topology), positive functions g : Ω → [0, 1] which are normalized : for all ω ∈ AZ+ ∑ g (aω) = 1. a∈A where aω ∈ Ω is the sequence (a, ω0, ω1,...).

Definition 2.11. A probability measure P on Ω = AZ+ is a called a g- measure if

P (ω0|ω1, ω2,...) = g (ω0ω1 ··· ) (P − a.s.),(2.3.5) equivalently, if for any f ∈ C (Ω, R) one has

Z Z " # f (ω) P ( dω) = ∑ f (a0ω) g (a0ω) P ( dω) .(2.3.6) Ω Ω a0

We will refer to (2.3.6) as one-sided Dobrushin-Lanford-Ruelle equations. The function g can be viewed an a one-sided quasi-local non-null specifi- cation. One can show [9] that the expression in square brackets in (2.3.6) has the following probabilistic interpretation: namely, as the conditional expectation of f with respect to a certain σ-algebra. If A is the σ-algebra −1  of Borel subsets of Ω, and F1 = S A = σ {ωk}k≥1 , then

EP( f |F1)(ω) = ∑ f (a0ω1ω2 ...)g(a0ω1ω2 ...), a0∈A where for all continuous f , and more generally, for all f ∈ L1(Ω, P). All previous examples we considered – Markov chains, k-step Markov chains, VLMC’s with bounded memory, CMMC’s, fall into this class of g- measures. It is easy to see that the VLMC with infinite memory in Example 2.10:

 k  P 1|0 1∗ = pk, k ≥ 0, (2.3.7) ∞ P (1|0 ) = p∞,(2.3.8) is also a g-measure provided

p∞ = lim pk. k→∞ 22 CHAPTER 2. GIBBS STATES

Theorem 2.12. For any g ∈ GΩ, there exists at least one translation invariant g-measure.

Proof. Consider the transfer operator Lg acting on continuous functions C(Ω) defined as Lg f (ω) = ∑ g(aω) f (aω). a∈A

The operator Lg : C(Ω) → C(Ω), and hence has a well-defined dual ∗ operator Lg acting on M(Ω) – the set probability measures on Ω, given by Z Z ∗ (Lg f ) dP = f d(LgP).(2.3.9) Ω Ω ∗ Moreover, Lg maps the set of translation invariant measures MS(Ω) into itself and is continuous. The Schauder-Tychonoff fixed point theorem states that any continuous map of a compact convex set has a fixed point. Since the set of translation invariant probability measures MS(Ω) is com- ∗ pact and convex, operator Lg has a fixed point, say P, and hence (2.3.6) ∗ follows from (2.3.9) since P = LgP, and therefore, P is the g-measure. ∗  Exercise 2.13. Validate that indeed Lg preserves MS(Ω).

Theorem 2.14 (Walters). If g ∈ GΩ, then any g-measure P has full support. Moreover, if g1, g2 ∈ GΩ are such there is a measure P which is a g1-measure, as well as g2-measure, then g1 = g2.

Proof. It is sufficient to prove that any g-measure P assigns positive prob- n−1 ability to any cylinder set [c0 ]. Furthermore, since g is continuous and positive, κ = inf g(ω) > 0. ω∈Ω Since P is g-measure, for any continuous function f : Ω → R one has Z Z f (ω)P( dω) = ∑ f (a0ω)g(a0ω)P( dω), Ω Ω a0∈A and iterating the previous identity, one obtains   Z Z n−1 n−1 n−1 f (ω)P( dω) =  ∑ f (a0 ω) ∏ g(ai ω) P( dω). Ω Ω n−1 n i=0 ~a=(a0 )∈A Consider ( 1 if ωi = ci, i = 0, . . . , n − 1, f (ω) = I[cn−1](ω) = 0 0 otherwise. 2.3. GIBBS AND OTHER REGULAR STOCHASTIC PROCESSES 23

Therefore,

Z Z n−1 n−1 n−1 P([c ]) = I n−1 (ω)P( dω) = g(c ω)P( dω) 0 [c0 ] ∏ i Ω Ω i=0 Z ≥ κn P( dω) = κn > 0. Ω

The second claim follows from the claim that if g1 and g2 have a common g-measure, then g1 = g2 (P − a.s.), and hence everywhere on Ω.

Theorem 2.15. A fully supported measure P on Ω = AZ+ is a g-measure for some g ∈ G(Ω) if and only if the sequence

P([ω0 ... ωn]) P(ω0|ω1 ... ωn) = P([ω1 ... ωn] converges uniformly in ω as n → ∞.

Proof. Suppose P is g-measure. Using expression for probabilities of cylin- ders obtained in the proof of previous theorem one has

− R n−1 n− P(cn 1) (∏ g(c 1ω))P( dω) 0 = Ω i=0 i . ( n−1) R ( n−1 ( n−1 )) ( ) P c1 Ω ∏i=1 g ci ω P dω Thus,   n−1 n−1 n−1 P(c0|c1 ) ∈ inf g(c0 ω), sup g(c0 ω) . ω ω Since g is continuous on Ω = AZ+ and hence uniformly continuous, the n-th variation of g

n−1 n−1 varn(g) = sup sup g(c0 ω) − g(c0 ω˜ ) → 0 n−1 ω,ω ˜ ∈Ω c0 as n → ∞, and thus

n−1 P(c0|c1 )/g(c) ∈ [1 − δn, 1 + δn], where δn → 0. n In the opposite direction, gn(c) = P(c0|c1 ) are local functions, since they depend only on finitely many coordinates, and thus gn’s are also continu- ous. This sequence of continuous functions on a compact set is assumed to converge uniformly on a compact metric space ω. Therefore, the limit 24 CHAPTER 2. GIBBS STATES

function g is also continuous. By the Martingale Convergence Theorem the limit must be equal to

P(c0|c1, c2,...) = g(c0, c1,...) P − a.e.,

and hence P is a g-measure.

As we have seen, since g is continuous,

n−1 n−1 varn(g) = sup |g(c0 ω) − g(c0 ωe )| → 0 c,ω,ωe

as n → ∞. Many properties of g-measures depend on how quickly varn(g) → 0. For example, if g has exponentially decaying variation:

n varn(g) ≤ cθ

for some c > 0, θ ∈ (0, 1), or more generally, g has summable variation:

∞ ∑ varn(g) < ∞, n=1

then there exists a unique g-measure with good properties.

Theorem 2.16. Suppose g ∈ GΩ has summable variations, then g admits a unique g-measure P. Moreover, P (viewed as a measure on AZ) is Gibbs for some uniformly absolutely summable interaction ψ.

Proof. For any f ∈ C(Ω) and every n ≥ 1, one has

n−1 n n−1 n−1 Lg f (ω) = ∑ ∏ g(ai ω) f (a0 ω). n−1 n i=0 a0 ∈A

k−1 k−1 Consider arbitrary ω, ωe ∈ Ω such that ω0 = ωe0 , k ∈ N. Then

n−1 n−1 Ln ( ) − Ln ( ) ≤ ( n−1 ) − ( n−1 ) · | ( n−1 )| g f ω g f ωe ∑ ∏ g ai ω ∏ g ai ωe f a0 ω n−1 n i=0 i=0 a0 ∈A n−1 n−1 n−1 n−1 + ∑ ∏ g(ai ω)| · | f (a0 ω) − f (a0 ωe )| n−1 n i=0 a0 ∈A

n−1 n−1 ≤ || || ( n−1 ) − ( n−1 ) + ( ) f C(Ω) ∑ ∏ g ai ω ∏ g ai ωe varn+k f , n−1 n i=0 i=0 a0 ∈A 2.3. GIBBS AND OTHER REGULAR STOCHASTIC PROCESSES 25 since n−1 n−1 ∑ ∏ g(ai ω) = 1 n−1 n i=0 a0 ∈A for all n ≥ 1 and every ω, since g is normalized. Let is now derive a uniform estimate of

n−1 n−1 ( n−1 ) = ( n−1 ) − ( n−1 ) Cn a0 , ω, ωe : ∏ g ai ω ∏ g ai ωe i=0 i=0 − − n−1 ∏n 1 g(an 1ω) = g(an−1ω) i=0 i − 1 . ∏ i e n−1 ( n−1 ) i=0 ∏i=0 g ai ωe One can continues as follows, − − ∏n 1 g(an 1ω) n−1  i=0 i = exp [log g(an−1ω) − log g(an−1ω)] ∈ [D−1, D ] n−1 ( n−1 ) ∑ i i e k k ∏i=0 g ai ωe i=0 k−1 k−1 where, taking into account that ω0 = ωe0 , ∞ Dk = ∑ varj(log g) → 0 as k → ∞, j=k+1 because if g has summable variation, then so is log g (see Exercise 2.17 0 Dk −Dk below). Therefore, for Dk = max(e − 1, 1 − e ) → 0, one has

n−1 n−1 0 n−1 Cn(a0 , ω, ωe ) ≤ Dk ∏ g(ai ωe ), i=0 and hence

n n 0 Lg f (ω) − Lg f (ωe ) ≤ Dk|| f ||C(Ω) + vark( f ). n The sequence of functions fn = Lg f is uniformly continuous, i.e., for every e > 0, there exists k ∈ N such that n n Lg f (ω) − Lg f (ωe ) < e k−1 k−1 −k for all n and every ω, ωe ∈ Ω such that ω0 = ωe0 , i.e., ρ(ω, ωe ) ≤ 2 . Moreover, the sequence { fn} is uniformly bounded n ||Lg f ||C(Ω) ≤ || f ||C(Ω).

Hence, by the Arzelá-Ascoli theorem, the sequence { fn} has a uniformly converging subsequence { fnj }:

nj Lg f ⇒ f∗ ∈ C(Ω). 26 CHAPTER 2. GIBBS STATES

n Now we are going to show that in fact f∗ is constant, and Lg f → f∗ as n → ∞. For any continuous function h let

α(h) = min h(ω). ω∈Ω It is easy to see that for any f ∈ C(Ω) one has

n α( f ) ≤ α(Lg f ) ≤ ... ≤ α(Lg f ) ≤ ...,

nj n Moreover, taking into account that Lg f → f , and monotonicity of α(Lg f ), we conclude that

n α( f∗) = lim α(L f ) = α(Lg f∗). n g

Furthermore, suppose ω ∈ Ω is such that

α( f∗) = α(Lg f∗) = Lg f∗(ω)

but then Lg f∗(ω) = g(c0ω) f∗(a0ω) = inf f (ω), ∑ ω∈Ω e c0∈A e

and hence for each c0,

f∗(c0ω) = inf f (ωe ) = α( f∗). ωe ∈Ω n−1 n Repeating the argument, we conclude that for any c0 ∈ A , n−1 f∗(c0 ω) = α( f∗),

i.e., f∗ attains its minimal value on every cylinder. Since f∗ is continuous, n f∗ thus must be constant, and then Lg f ⇒ f∗ as n → ∞. In conclusion, n for every f ∈ C(Ω), Lg f ⇒ f∗ = const. In other words, we have a linear functional V : C(Ω) → R defined as

V( f ) = f∗, which is non-negative: f ≥ 0 implies V( f ) ≥ 0, normalized V1 = 1. By the Riesz representation theorem, there exists a probability measure P on Ω such that Z V( f ) = f dP. Ω ∗ Since obviously V( f ) = V(Lg f ), we have P = LgP, and hence P is the g-measure. If Pe is another g-measure, then integrating Z n Lg f → f dP Ω 2.4. GIBBS MEASURES IN DYNAMICAL SYSTEMS 27 with respect to Pe, we conclude that Z Z f dPe = f dP Ω Ω for every continuos f , and hence Pe = P.

 Exercise 2.17. Validate that indeed if g ∈ GΩ has summable variation, then log g has summable variation as well. Remark 2.18. (i) Various uniqueness conditions have been obtained in the past, see [19] for review. The major progress has been achieved Johansson and Öberg (2002) who proved uniqueness of g-measures for the class of g-functions with square summable variation: 2 ∑ varn(g) < ∞, n≥1 see also [16] for a recent result, strengthening and subsuming the previous results. (ii) Bramson and Kalikow [13] constructed a first example of a continuous positive normalized g-function with multiple g-measures. Berger et al [10] showed that 2 + e summability of variations is not sufficient for uniqueness of g-measures. (iii) One can try to generalize the theory of g-measures by considering functions g which are not necessarily continuous, but the available theory is still quite limited [9, 15].

2.4 Gibbs measures in Dynamical Systems

d Among all Gibbs measures on a lattice system Ω = AZ , a special subclass is formed by translation invariant measures. The lattice group Zd acts on n d Ω by shifts: (S ω)k = ωk+n for all ω, k, n ∈ Z . The P is invariant under this action, if P(S−nC) = P(C) for any event C and all n. Shortly after the founding work of Dobrushin, Lanford, and Ruelle, Sinai [4] extended the notion of invariant Gibbs measures to more general phase- spaces like manifolds, i.e., spaces lacking the product structure of lattice systems, thus making the notion of Gibbs states applicable to invariant measures of Dynamical Systems. In the subsequent works, it was shown that the important class of the natural invariant measures (now known as the Sinai-Ruelle-Bowen measures) of a large class of chaotic dynamical systems are Gibbs. That sparked a great interest in these objects in the 28 CHAPTER 2. GIBBS STATES

Dynamical Systems and Ergodic Theory. In Dynamical Systems literature one can find several definitions (not always equivalent) of Gibbs states. For example, the following definition due to Bowen [11] is very popular. Suppose f is a continuous homeomorphism of a compact metric space (X, d), and P is an f -invariant Borel probability measure on X. Then P is called Gibbs in the Bowen sense (abbreviated to Bowen-Gibbs) for a continuous potential ψ : X → R, if for some P ∈ R and every sufficiently small e > 0, there exists C = C(e) > 1 such that the inequalities   1 P y ∈ X : d( f j(x), f j(y)) < e ∀j = 0, . . . , n − 1 ≤ ≤ C C n−1 j  exp −nP + ∑j=0 ψ( f (x)) hold for all x ∈ X and every n ∈ N. In other words, the set of points staying e-close to x for n-units of time, has measure roughly e−Hn(x) with n−1 j Hn = nP − ∑j=0 ψ( f (x)). Capocaccia [1] gave the most general definition of Gibbs invariant measures of multidimensional actions (dynamical sys- tems) on arbitrary spaces, hence extending the Dobrushin-Lanford-Ruelle (DLR) construction substantially. Remark 2.19. For symbolic systems Ω = AZ or AZ+ with the dynamics given by the left shift S : Ω → Ω, the above definition can be simplified: a translation invariant measure P is called Bowen-Gibbs for potential ψ ∈ C(Ω, R) if there exists constants C > 1 and P such that for any ω ∈ Ω and all n ∈ N, one has   1 P ω ∈ Ω : ωj = ωj ∀j = 0, . . . , n − 1 ≤ e e ≤ C (2.4.1) C n−1 j  exp −nP + ∑j=0 ψ(S (ω)) .

2.4.1 Equilibrium states

Suppose ψ : Ω → R is a continuous function, again, for simplicity assume d d that Ω = AZ or AZ+ . Define the topological pressure of ψ as h Z i P(ψ) = sup h(P) + ψdP , P∈M1(Ω,S) where supremum is taken over the set of translation invariant probability measures on Ω.

A measure P ∈ M1(Ω, S) is called an equilibrium state for potential ψ ∈ C(Ω) if Z P(ψ) = h(P) + ψdP.

For a given potential ψ, denote the set of equilibrium states ES ψ(Ω, S). For symbolic systems, ES ψ(Ω, S) 6= ∅ for any continuous ψ. 2.4. GIBBS MEASURES IN DYNAMICAL SYSTEMS 29

Theorem 2.20 (Variational Principles). The following holds

d • Suppose Ω = AZ and Ψ is a translation invariant uniformly absolute convergent interaction. Then a translation invariant measure P is Gibbs for Ψ if and only if P is an equilibrium state for 1 ψ(ω) = − Ψ(V, ω). ∑ |V| 0∈VbZd

• Suppose Ω = AZ+ and g is a positive continuous normalized function on Ω, i.e., ∑ g(a0ω) = 1 a0∈A for all ω ∈ AZ+ . Then a translation invariant measure P is a g-measure if and only if P is an equilibrium state for ψ = log g. • Suppose Ω = AZ+ or Ω = AZ, a translation invariant measure P, which is Bowen-Gibbs for ψ, is also an equilibrium state for ψ.

In the last case, it is not true that for an arbitrary continuous function ψ, the corresponding equilibrium states are Bowen-Gibbs. However, under various additional assumptions of the “smoothness” of ψ, equilibrium states are Bowen-Gibbs. Theorem 2.21 (Effectively (Sinai, 1972), see also Ruelle’s book Chapter). Suppose P is a translation invariant g-measure for a function g ∈ G(AZ+ ) n with varn(g) ≤ Cθ , for some C > 0, θ ∈ (0, 1), and all n ∈ N. Then P is Gibbs for some translation invariant uniformly absolutely convergent interaction

φ = {φΛ(·)}ΛbZ.

Z Idea: Consider ϕ = − log g ∈ C(A ), but ϕ depends only on (ω0, ω1, ··· ) ∈ Z+ A . Let us try to find UAC interaction ϕΛn (·), Λn = [0, n] ∩ N, (φΛ ≡ 0 if Λ is not an interval) such that ∞ n ϕ(ω) = − ∑ φ([0, n], ω0 ). n=0 If we can, then P would be Gibbs for Φ. Why? By Variational Principle, any Gibbs measure for φ is also an equi- = − 1 ( ·) librium state for ψ ∑V30 |V| φ V, , and vice versa. = − 1 ( ·) = − ∞ ([ ] ·) = − ( ·) Claim: ψ ∑V30 |V| φ V, and ϕ ∑n=0 φ 0, n , ∑0=minV φ V, have equal sets of equilibrium states reason being Z Z ψ dP = ϕ dP 30 CHAPTER 2. GIBBS STATES for every translation invariant measure.

Z ∞ 1 0 Z ψ d = − φ([0, n], ωn+i) d P ∑ + ∑ i P n=0 n 1 i=−n ∞ 1 Z Z = − (n + 1) φ([0, n], ωn) d = ϕ d . ∑ + 0 P P n=0 n 1

n n Easy fact to check: varn(g) ≤ Cθ implies that varn(ϕ) ≤ C˜θ as well. Let ϕ˜ = −ϕ. Put

−→ φ0(ω0) = ϕ˜(ω0, 0 ), 1 1 −→ −→ φ{0,1}(ω0) = ϕ˜(ω0, 0 ) − ϕ˜(ω0, 0 ), . . n n −→ n−1 −→ φ{0,1,··· ,n}(ω0 ) = ϕ˜(ω0 , 0 ) − ϕ˜(ω0 , 0 ). −→ Here 0 is any fixed sequence. Clearly

N −→ n N ∀ Z+ φ{0,··· ,n}(ω0 ) = ϕ˜(ω0 , 0 ) −−−→ ϕ˜(ω)( ω ∈ A ). ∑ N→ n=0 ∞ Moreover ˜ n−1 ||φ{0,1,··· ,n}(·)||∞ ≤ varn−1(ϕ˜) ≤ Cθ , and hence

∞ ∑ ||φV(·)||∞ ≤ ∑ (n + 1)||φ{0,1,··· ,n}(·)||∞ < ∞ V30 n=0 as well. Chapter 3

Functions of Markov processes

Last modified on October 26, 2015

3.1 Markov Chains (recap)

We will consider Markov chains {Xn} with n ∈ Z+ (or n ∈ Z) with values in a finite set A. Time-homogeneous Markov chains are defined/characterized by probability transition matrices

P = p  p = (X = j | X = i) ij i,j∈A , ij P n+1 n . where P is a stochastic matrix:

• pij ≥ 0 for all i, j ∈ A,

• ∑ pij = 1 for all i ∈ A. j∈A

If initial probability distribution π = (π1,..., πM), πi ≥ 0 for all i ∈ A and ∑i∈A πi = 1 satisfies πP = π, then the process {Xn}n≥0 defined by

(a) choosing X0 ∈ A according to π,

(b) choosing Xn+1 ∈ A according to pi,· : if Xn = i, then Xn+1 = j with probability pij independently of all previous choices,

31 32 CHAPTER 3. FUNCTIONS OF MARKOV PROCESSES

is a stationary process. Its translation invariant law P on AZ+ is given by

k−1 ( = = = ) = ( ) P Xn an, Xn+1 an+1,..., Xn+k an+k π an ∏ pan+1,an+i+1 i=0

k+1 for all n ≥ 1 and (an,..., an+k) ∈ A .

If P is a strictly positive matrix : pij > 0 ∀i, j, then there exists a unique invariant π = πP. The corresponding measure P on AZ+ will have full support : every cylindric event

{Xn = an,..., Xn+k = an+k} will have positive probability: it follows from the fact that π > 0 as well.

 If P = pij is not strictly positive, i.e., pij = 0 for some i, j ∈ A, (equiva- lently, if transition from i to j is forbidden), in order to avoid some technical difficulties, we will assume that P is primitive:

Pn > 0 for some n ∈ N.

In this case, there exists a unique positive π satisfying π = πiP. The support of the corresponding translation invariant measure P is a subshift of finite type. Definition 3.1. Suppose C is an |A| × |A| matrix of 0’s and 1’s. Let

+ =  = ( ) ∈ AZ+ = ∀ ≥ ΩC ω ω0, ω1,... : Cωiωi+1 1 i 0 , n o Z ΩC = ω = (..., ω−1, ω0, ω1,...) ∈ A : Cωiωi+1 = 1 ∀i ∈ Z

are the corresponding one- and two-sided subshifts of finite type (deter- mined by C). Remark 3.2. Subshifts of finite type (SFT’s) are sometimes called topologi- cal Markov chains. SFT’s were invented by IBM to encode information on magnetic types (and then CD’s, DVD’s,. . .).

+ If P is primitive, support of P is a SFT Ω , or if viewed as an invariant mea- C  sure for {Xn}n∈Z, a two-sided SFT ΩC, with C = cij being the support of P = (pij), i.e., ( 1, pij > 0 cij = . 0, pij = 0 3.2. FUNCTION OF MARKOV PROCESSES 33

3.1.1 Regular measures on Subshifts of Finite Type

Notions of Gibbs and g-measures can be easily extended to SFT’s.

+ Definition 3.3. Support C is a primitive 0/1 matrix, denoted by G ΩC + the set of positive continuous normalized functions g : ΩC → (0, 1): ∑ g (aω) = 1 + a∈A: aω∈ΩC

+ + for all ω ∈ ΩC . A probability measure P on ΩC is called a g-measure for + g ∈ G ΩC if + P (ω0|ω1ω2 ··· ) = g (ω) for P − a.a. ω = (ω0, ω1,...) ∈ ΩC . + Theorem 3.4. Fully supported measure P on ΩC is a g-measure if and only if + P (ω0|ω1 ··· ωn) ⇒ on ΩC as n → ∞.

 Exercise 3.5. Suppose {Xn}n≥0 is a Markov chain for some primitive translation probability matrix P. Let P be the corresponding (unique) trans- + lation invariant measure on ΩC(P). Show that P is a g-measure, identify function g.

3.2 Function of Markov Processes

Definition 3.6. Suppose {Xn} is a Markov chain with values in a finite set A. Consider also a function ϕ : A → B, where |B| < |A|, and ϕ is onto. A process {Yn}

Yn = ϕ (Xn) for all n, is called a function of a Markov chain.

A variety of namers has been used for processes {Yn}:

• a Lumped Markov chain,

• an aggregated Markov chain,

• an amalgamated Markov chain,

• a 1-block factor of a Markov chain. 34 CHAPTER 3. FUNCTIONS OF MARKOV PROCESSES

The law of {Yn} will be denoted by Q. The measure Q is called a pushfor- ward of P and is denoted by

−1 Q = ϕ∗P = P ◦ ϕ

meaning  −  Q (C) = P ϕ 1 (C)

for any measurable event C ⊂ BZ+ .

 Exercise 3.7. If P is translation invariant, then so is Q.

Lumped/aggregated/amalgamated Markov chains can be found in many applications. In particular, in applications, when one has to simulate a Markov chain on a very large state space, and one would like to accelerate this process by simulating a chain on a much smaller space. Suppose B = {1, . . . , k}, the task is to find a partition of A into k-sets, A = L1 t · · · t Lk, with the corresponding function ϕ:  1, if X ∈ L  n 1 Yn = ϕ (Xn) = ...,  k, if Xn ∈ Lk

such that the resulting lumped process {Yn} is Markov. In this case, clearly

Q(Yn = k) = P (Xn ∈ Lk) .

Alternatively, one might be interested in finding Markov {Ycn} such that

Q(Ycn = k) ' P (Xn ∈ Lk) .

1 A classical question is under what conditions is the lumped chain {Yn} Markov. Two versions of this question :

• Yn = ϕ (Xn) is Markov for every initial distribution π of {Xn}.

• Yn = ϕ (Xn) is Markov for some initial distribution π of {Xn}.

Concepts of (strong) lumpability and weak lumpability are related to these situations.

1Studied by Dynkin, Kemeny, Snell, Rosenblat,. . . 3.2. FUNCTION OF MARKOV PROCESSES 35

Example 3.8. Consider the following Markov chains and corresponding functions:

The process {Yn} corresponding to the left example is Markov, while the corresponding process on the right is not Markov.

Theorem 3.9. Yn = ϕ(Xn) is Markov for any initial distribution π if and only −1 if for Lj = ϕ (j), j ∈ B, any i, and any a ∈ Li,

P(Xn+1 ∈ Lj|Xn = a) = constant.

The common value is precisely the transition probability

Y pi,j = P(Yn+1 = j|Yn = i).

For the discussion on sufficient conditions for weak lumpability, see Ke- meny and Snell.

3.2.1 Functions of Markov Chains in Dynamical systems

Z Z Suppose Σ1 ⊂ A , Σ2 ⊂ B are irreducible SFT. The coding map ϕ : A → B is onto, extends to a continuous factor map ϕ : Σ1 → Σ2. The coding map ϕ is called Markovian if some Markov measure on Σ1 is mapped into a Markov measure on Σ2. The problem of deciding on whether the coding map φ is Markov or not, is a somewhat complementary to problems studied in (see recent review of by Boyle and Petersen). Here are several interesting results of this theory:

• Non-Markovian ϕ’s exist. Markov Q on Σ2 has many pre-images P : Q = P ◦ ϕ−1, but none of them is Markov. 36 CHAPTER 3. FUNCTIONS OF MARKOV PROCESSES

• One can decide, for a given ϕ, P, whether Q = P ◦ ϕ−1 is a k-step Markov or not Markov.

The theory also offers many useful tools and concepts, e.g., compensation functions which we will discuss latter.

3.3 Hidden Markov Models

Hidden Markov Models (Chains) are random functions of Markov chains The ingredients are

(i) A Markov chain {Xn} with values in A, defined by a transition prob- ability matrix P and initial distribution π.

(ii) A family of emission probability kernels

Π = {Πi(·)}i∈A,

where Πi is a probability measure on the output space B.

The output (observable) process {Yn} is constructed as follows: for every n, let independently Yn = b with probability Πi(b) if Xn = i, i. e.

P(Yn = b|Xn = i) = Πi(b).

Example 3.10. Binary Symmetric Markov Chain transmitted through the Binary Symmetric Channel:

0 1 − e 0 e 1 1 − e 1

• The input {Xn}, Xn ∈ {0, 1}, is a symmetric Markov chain with transition probability matrix   1 − p p P =   , where p ∈ (0, 1). p 1 − p 3.3. HIDDEN MARKOV MODELS 37

• The noise {Zn} is an iid sequence of 0’s and 1’s with

P(Zn = 1) = e, P(Zn = 0) = 1 − e

• The output

Yn = Xn ⊕ Zn,

equivalently,

( Xn, with probability 1 − e Yn = Xn, with probability e

where Xn is the bit flip of Xn.

We can represent {Yn} as a deterministic function of an extended Markov chain ext 2 Xn = (Xn, Zn) ∈ {0, 1} , with transition probability matrix Pext

  (1 − p)(1 − e)(1 − p)e p(1 − e) pe      (1 − p)(1 − e)(1 − p)e p(1 − e) pe  Pext =   .    p(1 − e) pe (1 − p)(1 − e)(1 − p)e    p(1 − e) pe (1 − p)(1 − e)(1 − p)e

ext ext P(Xn+1 = (an+1, bn+1))|Xn = (an, bn)) = P(Xn+1 = an+1|Xn = an)P(Zn+1 = bn+1).

The fact is true in greater generality.

ext Theorem 3.11. If {Yn} is a Hidden Markov Chain, then Yn = ϕ(Xn ) for some ext finite state Markov chain {Xn } and some coding function ϕ.

 Exercise 3.12. Give a proof of this fact. (Hint: What can you say about {(Xn, Yn)} as a process?)

Formally, we only have to study functions of Markov chains. Though, sometimes it is more convenient to keep the original HMM-representation of {Yn} in terms of (π, P, Π). 38 CHAPTER 3. FUNCTIONS OF MARKOV PROCESSES

3.4 Functions of Markov Chains from the Ther- modynamic point of view

The principal question we will address now is whether the output process Yn = ϕ(Xn), where {Xn} is Markov, is regular? More specifically, whether the law Q of {Yn} is a g- or a Gibbs measure. And if so, can we identify the corresponding interaction or the g-function?

Example 3.13 (Blackwell-Furstenberg-Walters-Van den Berg). This exam- ple appeared independently in Probability Theory, Dynamical Systems, and Statistical Physics. Suppose {Xn}, Xn ∈ {−1, 1}, is a sequence of independent identically distributed random variables with

P(Xn = 1) = p, P(Xn = −1) = 1 − p, p 6= 1/2.

ext ext Let Yn = Xn · Xn+1 for all n. Then, Yn = ϕ(Xn ), where Xn = (Xn, Xn+1) ext and {Xn } is Markov with

  p 1 − p 0 0      0 0 p 1 − p  P =   .    p 1 − p 0 0    0 0 p 1 − p

The output process {Yn} takes values in {−1, 1}, and as we will see, the measure Q has full support. It turns out that the process {Yn} has ex- tremely irregular conditional probabilities.

Let us compute probability of a cylindric event

n [y0 ] = {Y0 = y0, Y1 = y1,..., Yn = yn}, y ∈ {−1, 1}.

n −1 n To compute Q([y0 ]), we have to determine ϕ ([y0 ]), i.e., full preimage of n [y0 ] in the {Xn}-space. It is easy to see that

−1 n ϕ ([y0 ]) = {x0 = 1, x1 = x0, x2 = y0y1,..., xn+1 = y0y1 ··· yn} t{x0 = −1, x1 = −y0, X2 = −y0y1,..., xn+1 = −y0y1 ··· yn}. 3.4. FUNCTIONS OF MARKOV CHAINS FROM THE THERMODYNAMIC POINT OF VIEW39

Hence,

n Q([y0 ]) = P(X0 = 1, Xk+1 = y0 ··· yk, k = 0, 1, . . . , n) +P(X0 = −1, Xk+1 = −y0 ··· yk, k = 0, 1, . . . , n) = p] of 1’s in (1, y0, y0y1,..., y0y1 ··· yn) ×(1 − p)] of 1’s in (1, y0, y0y1,..., y0y1 ··· yn) +p] of 1’s in (−1, −y0, −y0y1,..., −y0y1 ··· yn) ×(1 − p)] of 1’s in (−1, −y0, −y0y1,..., −y0y1 ··· yn).

Let Sn = y0 + y0y1 + ··· + y0y1 ··· yn = ] of 1’s − ] of (−1)’s. But since we also have that ] of 1’s + ] of (−1)’s = n + 1, we conclude that 1+ n+1+Sn n+1−Sn P([1, y0, y0y1,..., y0y1 ··· yn]) = p 2 (1 − p) 2 and

n+1−Sn 1+ n+1+Sn P([−1, −y0, −y0y1,..., −y0y1 ··· yn]) = p 2 (1 − p) 2 .

Therefore,

n+1+Sn n+1−Sn n+1−Sn n+1+Sn n 1+ 2 2 2 1+ 2 Q([y0 ]) = p (1 − p) + p (1 − p) "   Sn   Sn # n+1 n+1 p 2 1 − p 2 = p 2 (1 − p) 2 p + (1 − p) . 1 − p p

Similarly,

 Sfn Sfn    2   2 n n n p 1 − p Q([y ]) = p 2 (1 − p) 2 p + (1 − p)  , 1 1 − p p

where Sfn = y1 + y1y2 + ··· + y1y2 ··· yn. Since, Sn = y0(1 + Sfn), one has

 Sfn+1 Sfn+1  q 2 − 2 n pλ + (1 − p)λ Q(y0 = 1|y1 ) = p(1 − p)   Sfn − Sfn pλ 2 + (1 − p)λ 2  √ ( −p)  p λλSfn + 1√ q λ = p(1 − p)   pλSfn + (1 − p)

aλSfn + b = : cλSfn + d 40 CHAPTER 3. FUNCTIONS OF MARKOV PROCESSES where r a q p(1 − p) b = p(1 − p)λ = p 6= = (1 − p) = c λ d since p 6= 1/2. n Suppose for simplicity that λ > 1. For any y1 , one can choose continuation 5n zn+1 such that corresponding Sfn > 0 and large. Equally well one can 5n choose wn+1 s. t. Sfn < 0 and |Sfn| is large. In the first case, n 5n a Q(y0 = 1|y z + ) ' 1 n 1 c and in the second case,

n 5n b Q(y0 = 1|y w + ) ' . 1 n 1 d

Therefore, the conditional probabilities Q(y0 = 1|y1y2 ··· ) are everywhere discontinuous. In some sense this is the worst possible and most irregular behavior possible. Example 3.14 (Jitter in Optical Storage). Baggen et al., Entropy of a bit-shift channel.IMS LNMS Vol. 48, p. 274–285 (2006).

Figure 3.1: Images of DVD disks. The left image shows a DVD-ROM. The track is formed by pressing the disk mechanically onto a master disk. On the right is an image of a rewritable disk. The resolution has been increased to demonstrate the irregularities in the track produced by the laser. These irregularities lead to higher probabilities of jitter errors.

Information is encoded not by the state of the disk (0/1, high/low), but in the lengths of intervals (see Figure above). At the time the disk is read, the transitions might be detected at wrong position due to noise, inter-symbol interference, clock jitter, and some other distortions, the jitter being the main contributing factor. 3.4. FUNCTIONS OF MARKOV CHAINS FROM THE THERMODYNAMIC POINT OF VIEW41

If X is the length of a continuous interval of low or high marks on the disk. Then, after reading, the detected length is

Y = X + ωle f t − ωright where ωle f t, ωright take values {−1, 0, 1}, meaning that the transition be- tween the "low"–"high" or "high"–"low" runs was detected one time unit too early, on time, or one unit too late, respectively For technical reasons, the so-called run-length limited sequences are used. By definition, a (d, k)-RLL sequence has at least d and at most k 0’s between 1’s. So a "high" mark of 4 units, followed by a "low" of 3 units, followed by a "high" of 4 units correspond to the RLL-sequence 100010010001 written to the disk. (CD has a (2,10)-RLL). Shamai and Zehavi proposed the following model of a bit-shift chanel. The signal. Let A = {d,..., k}, where d, k ∈ N, d < k and d ≥ 2. The input Z Z space is A = {x = (xi) : xi ∈ A} with the Bernoulli measure P = p , p = (pd,..., pk). The jitter. Let us assume that at the left endpoint of the each interval the jitter produces an independent error with distribution

PJ(ωle f t = −1) = PJ(ωle f t = 1) = e, PJ(ωle f t = 0) = 1 − 2e, e > 0.

Let {ωn} be the sequence of these independent errors. Then If Xn was the length of the n-the interval, then the detected length Yn is given by

Yn = Xn + ωn − ωn+1.

ext Since, ωn = (ωn, ωn+1) is Markov, Yn is a Hidden Markov process or a function of a Markov process. The corresponding measure Q does not have continuous conditional probabilities. The irregularity is less severe than in the previous example, but one easily find the so-called bad configurations for Q. For example,

Y0 = d − 2, Y1 = ... = Yn = d, has effectively one preimage:

φ−1[(d − 2) d ... d] ⊆ [d ... d] × [−1 1 . . . 1], | {z } | {z } | {z } n n+1 n and hence

n+1 n+1 Q(Y0 = d − 2, Y1 = ... = Yn = d) ≤ pd e . 42 CHAPTER 3. FUNCTIONS OF MARKOV PROCESSES

While [d ... d] has large preimage, and hence large probability: one can show that for some C > 0 n Q(Y1 = ... = Yn = d) ≥ Q(Y0 = d − 2, Y1 = ... = Yn = d). C Thus C Q(Y0 = d − 2|Y1 = ... = Yn = d) ≤ → 0as n → ∞, n and therefore, Q is not a g-measure.

The common feature of the previous two examples is the fact that the output process Y = {Yn} is a function of a Markov chain {Xn} with transition probability matrix P having some zero entries. The following result shows that this principle cause for irregular behavior of conditional probabilities.

Theorem 3.15. Suppose {Xn}, Xn ∈ A, is a stationary Markov chains with a strictly positive transition probability matrix

P = (pi,j)i,j∈A > 0.

Suppose also that φ : A 7→ B is a surjective map, and

Yn = φ(Xn) for every n.

(i) Then the sequence of conditional probabilities

Q(y0|y1:n) = Q(Y0 = y0|Y1:n = y1:n)

Z+ converges uniformly (in y = y0:∞ ∈ B ) as n → ∞ to the limit denoted by

g(y) = Q(y0|y1:∞).

Hence, Q is a g-measure. (ii) Moreover, there exist C > 0 and θ ∈ (0, 1) such that

n βn := sup sup g(y0:n−1un:∞) − g(y0:n−1vn:∞) ≤ Cθ (3.4.1) y0:n−1 un:∞,vn:∞

for all n ∈ N. (iii) Therefore, Q (viewed as a measure on BZ) is Gibbs in the DLR-sense.

This result has been obtained in literature a number of times using differ- ent methods, depending on the method employed, the values of

1/n θ∗ = lim sup(βn) n 3.4. FUNCTIONS OF MARKOV CHAINS FROM THE THERMODYNAMIC POINT OF VIEW43 vary greatly. Sufficient condition P > 0 is definitely far from being necessary. For example, Markus and Han extended the result of Theorem 3.15 to the case when every column of P is either 0 or strictly positive. However, the problem of finding the necessary and sufficient conditions for functions of Markov chains to be “regular” is still open. A rather curious observation is that in all known cases if a function of Markov process Yn = φ(Xn), is regular as a process, e.g., Q is a g-measure, then Q is is “highly regular”, e.g., has an exponential decay of βn. 44 CHAPTER 3. FUNCTIONS OF MARKOV PROCESSES Chapter 4

Hidden Gibbs Processes

Last modified on April 24, 2015

4.1 Functions of Gibbs Processes

4.2 Yayama

Definition 4.1 (Bowen). Let Ω = AZ. A Borel probability measure P on Ω is called a Gibbs measure for φ : Ω → R if there exists C > 0 such that

1 P([x0x1 ··· xn−1]) <  < C (4.2.1) C exp (Snφ)(x) − nP for every x ∈ Ω and n ∈ N, where

n−1 k (Snφ)(x) = ∑ φ(T x). k=0

Note: For every m, n ≥ 1 and all x.

n (Sn+mφ)(x) = (Snφ)(x) + (Smφ)(T x).

The sequence of functions φn(x) := (Snφ)(x), n ≥ 1, satisfies

n φn+m(x) = φn(x) + φm(T x).

∞ Definition 4.2. A sequence of continuous functions {φn}n=1 is called

45 46 CHAPTER 4. HIDDEN GIBBS PROCESSES

• almost additive if there is a constant C ≥ 0 such that for every n, m ∈ N and for every x

n φn+m(x) − φn(x) − φm(T x) ≤ C.

• subadditive n φn+m(x) ≤ φn(x) + φm(T x) • asymptotically subadditive if for any e > 0 there exists a subaddi- ∞ tive sequence {ψn}n=1 on X such that

lim sup(1/n) sup |φn(x) − ψn(x)| ≤ e. n→∞ x∈X ∞ Definition 4.3. Suppose {φn}n=1 is an asymptotically subadditive se- quence of continuous functions. A Borel probability measure P is called a Gibbs measure for {φn} if there exist C > 0 and P ∈ R such that

1 P([x0x1 ··· xn−1]) <  < C (4.2.2) C exp φn(x) − nP for every x and all n ∈ N.

The sequence {φn} is said to have bounded variation if there exists a con- stant M > 0 such that  n−1 n−1 sup sup |φn(x) − φn(x¯)| : x0 = x¯0 ≤ M. n x,x¯

Theorem 4.4. Suppose φ = {φn} is an almost additive sequence with bounded variation. Then there exists a unique invariant Gibbs measure Pφ for an φ; Pφ is mixing. Moreover, Pφ is also a unique equilibrium state for φ:

1 Z n 1 Z o h(Pφ) + lim φndPφ = sup h(Q) + lim φndQ . n→∞ n→∞ n Q n Theorem 4.5 (Yuki Yayama). Suppose P is the unique invariant Gibbs measure Z on A for the almost additive sequence φ = {φn} with bounded variation, and −1 Q = π∗P = P ◦ π where π : AZ 7→ BZ is the the single-site factor map. Then Q is the unique invariant Gibbs measure for the asymptotically subaddi- tive sequence ψ = {ψn} with bounded variation, where

∗ ψn(y) = ∑ sup exp(φn(x )). n−1 −1 n−1 x∗∈[xn−1] x0 ∈π (y0 ) 0 Remark. Almost Additive ⊆ Asymptotically Subadditive 4.3. REVIEW: RENORMALIZATION OF G-MEASURES 47

4.3 Review: renormalization of g-measures

Theorem 4.6. Suppose P is a g-measure with βn = varn(g) → 0 sufficiently −1 ∞ fast, then Q = P ◦ π is a g-measure,˜ i.e., ν(y0|y1 ) = g˜(y) and

β˜n = varn(g˜) → 0.

−αn −α˜ n Denker & Gordin (2000) βn = O(e ) ⇒ β˜n=O(e ) 2 ˜ Chazottes & Ugalde (2009) ∑n n βn < ∞ ⇒ ∑n βn < ∞ ˜ Kempton & Pollicott (2009) ∑n nβn < ∞ ⇒ ∑n βn < ∞ ˜ Redig & Wang (2010) ∑n βn < ∞ ⇒ βn → 0 ˜ V. (2011) ∑n βn < ∞ ⇒ βn → 0

√ −αn −α n −αn βn ≈ e , then β˜n ≈ e [CU,KP], ≈ e [DG, RW]. −α −α+2 −α+1 βn ≈ n , then β˜n ≈ n [CU,RW], and ≈ n [KP]. OPEN PROBLEM: Identify a class of (generalized) Gibbs measures closed under renormalization transformations – single site/one-block factors.

4.4 Cluster Expansions

n−1 n−1 Q(y0 ) = ∑ P(x0 ). n−1 −1 n−1 x0 ∈π y0

n−1  n−1  log Q(y0 ) = log ∑ P(x0 ) . n−1 −1 n−1 x0 ∈π y0

Example: Binary symmetric channel with the parameter e:   n n |{i:yi=xi}| |{i:yi6=xi}| log Q(y0 ) = log ∑ P(x0 )(1 − e) e n −1 n x0 ∈π y0

Emission probabilities satisfy

πxi (yi) = Prob(Yi = yi|Xi = xi) = (1 − ε)I[xi = yi] + ε(1 − I[xi = yi])   = (1 − ε) q + (1 − q)I[xi = yi] . 48 CHAPTER 4. HIDDEN GIBBS PROCESSES

Thus n n n+1 n Q(y0 ) = (1 − ε) ∑ P(x0 ) ∏(q + (1 − q)I[xi = yi]) n x0 i=0 n+1 n = (1 − ε) ∑ P(x0 ) ∑ ∏(1 − q)I[xi = yi] ∏ q n x0 A⊆[0,n] i∈A i6∈A n+1 |A| n+1−|A| n = (1 − ε) ∑ (1 − q) q P(x0 : xi = yi ∀i ∈ A) A⊆[0,n] where we used

n n ∑ P(x0 ) ∏ I[xi = yi] = P(x0 : xi = yi ∀i ∈ A). n x0 i∈A

n n+1 |A| n+1−|A| n Q(y0 ) = (1 − ε) ∑ (1 − q) q P(x0 : xi = yi ∀i ∈ A) A⊆[0,n] n o use substitution B = Ac = [0, n] \ A

n+1 n+1−|B| |B| n c = (1 − ε) ∑ (1 − q) q P(x0 : xi = yi ∀i ∈ B ) B⊆[0,n] |B| n+1  q  n c = [(1 − ε)(1 − q)] P(x : xi = yi ∀i ∈ B ) ∑ 1 − q 0 B⊆[0,n] |B| n c +  q  P(x : xi = yi ∀i ∈ B ) = (1 − 2ε)n 1P(yn) 0 0 ∑ 1 − q (yn) B⊆[0,n] P 0 |B| n+1 n  q  P(yBc ) = (1 − 2ε) P(y0 ) ∑ 1 − q (y y c ) B⊆[0,n] P B B |B| n+1 n  q  1 = (1 − 2ε) P(y0 ) ∑ 1 − q (y |y c ) B⊆[0,n] P B B

|B| n n+1 n  q  1 Q(y0 ) = (1 − 2ε) P(y0 ) ∑ 1 − q (y |y c ) B⊆[0,n] P B B ! |B| n+1 n  q  1 = (1 − 2ε) P(y0 ) 1 + . ∑ 1 − q (y |y c ) ∅6=B⊆[0,n] P B B

Suppose P is MARKOV B = I1 t ... t Ik

k  q |B| 1  q |Ij| 1 = ∏ 1 − q (yB|yBc ) 1 − q (yI |y c ) P j=1 P j Ij ∩[0,n] 4.5. CLUSTER EXPANSION 49

! ( n) = ( − )n+1 ( n) + Q y0 1 2ε P y0 1 ∑ zI1 ... zIk I1t...tIk⊆[0,n] Cluster expansions Object: partition function given as

1 ( ) = + ( ) ΞP z 1 ∑ ∑ φ γ1,..., γk zγ1 ... zγk , k! k k≥1 (γ1,...,γk)∈P where P = set of polymers,

z = {zγ ∈ C |γ ∈ P} = fugacities or activities, φ(γ1) = 1, φ(γ1,..., γk) = ∏ I[γi ∼ γj], i6=j ∼ is a relation on P.

4.5 Cluster expansion

Theorem. If 1 ( ) = + ( ) ΞP z 1 ∑ ∑ φ γ1,..., γk zγ1 ... zγk , k! k k≥1 (γ1,...,γk)∈P then 1 ( ) = T( ) log ΞP z ∑ ∑ φ γ1,..., γk zγ1 ... zγk , k! k k≥1 (γ1,...,γk)∈P T where φ (γ1,..., γk) are the truncated function of order k. Convergence takes place on the polydisc

D = {z : |zγ| ≤ ργ, γ ∈ P}.

Truncated functions φT(γ1, . . . , γk) For k = 1, φT(γ1) = 1. Suppose k > 1. Define the graph

G = G{γ1,...,γk} = (V(G), E(G)), V(G) = {1, . . . , k},  E(G) = [i, j] : γi  γj, 0 ≤ i, j ≤ k

G disconnected T( ) = If {γ1,...,γk} is , then φ γ1,..., γk 0. 50 CHAPTER 4. HIDDEN GIBBS PROCESSES

G connected If {γ1,...,γk} is ,

T #E(G) φ (γ1,..., γk) = ∑ (−1) , ⊂G G {γ1,...,γk} G conn. spann. where G ranges over all connected spanning subgraphs of G, and E(G) is the edge set of G.

Definition 4.7. If γ1, γ2 are two intervals in Z+, we say that γ1 and γ2 are compatible (denoted by γ1 ∼ γ2), if γ1 ∪ γ2 is not an interval.

Lemma 4.8.

 q |B| 1 1+ ∑ 1 − q (y | y c ) ∅6=B⊆[0,n] P B B ∞ 1 [ n] [ n] = 1 + φ(γ ,..., γ )z 0, ... z 0, , ∑ k! ∑ 1 k γ1 γk k=1 ( )∈P k γ1,...,γk [0,n]

where the set of polymers P[0,n] is the collection of subintervals of [0, n], φ(γ1) = 1, and φ(γ1,..., γk) = ∏i6=j I[γi ∼ γj], fugacities are given by

!b−a+1 ε 1 [0,n] = z[a,b] . 1 − 2ε P y[a,b]|y{a−1,b+1}∩[0,n]

n Corollary. For all y0 , one has

log Q(y0|y[1,n]) = log py0,y1 + log(1 − 2ε) ∞ " 1 [ n] [ n] + φT(γ ,..., γ )z 0, ... z 0, ∑ k! ∑ 1 k γ1 γk k=1 ( )∈P k γ1,...,γk [0,n] ∪iγi∩{0,1}6=∅ # − T( ) [1,n] [1,n] ∑ φ γ1,..., γk zγ1 ... zγk ( )∈P k γ1,...,γk [1,n] ∪iγi∩{1}6=∅ !m ∞ ε ∞ = log p + log(1 − 2ε) + C (y ) y0,y1 ∑ − ∑ n,m,k [0,n] m=1 1 2ε k=1

Observation 1: Cn,m,k = 0 for k > m.

Observation 2: If m < n, then Cn,m,k depends only y[0,m+1]. 4.5. CLUSTER EXPANSION 51

log Q(y0|y[1,n]) = log py0,y1 + log(1 − 2ε) !m ∞ ε ∞ + C (y ) ∑ − ∑ n,m,k [0,n] m=1 1 2ε k=1

Observation 1: Cn,m,k = 0 for k > m.

Observation 2: If m < n, then Cn,m,k depends only y[0,m+1].

1 1 1 C2,1,1(y[0,2]) = + − P(y0|y1) P(y1|y0, y2) P(y1|y2) 1 1 1 C3,2,1(y[0,3]) = + − P(y0, y1|y2) P(y1, y2|y0, y3) P(y1, y2|y3) 1 1 1 C ( ) = − − − 3,2,2 y[0,3] 2 2 P(y0|y1) P(y0|y1)P(y1|y0, y2) P(y1|y0, y2) 1 1 1 + − + 2 . P(y1|y2) P(y1|y0, y2)P(y2|y1, y3) P(y1|y2)P(y2|y1, y3) Theorem 4.9 (Fernandez-V.). (i) The expansions

∞ ( | n) = (n)( n) k log Q y0 y1 ∑ fk y0 ε .(4.5.1) k=0

converge absolutely and uniformly on the complex disk |ε| ≤ r with r p ≤ .(4.5.2) 1 − 2r 4

(ii) Uniformly on this disc

∞ n ∞ ∞ k lim log Q(y0|y ) = log g(y ) = fk(y ) ε .(4.5.3) n→∞ 1 0 ∑ 0 k=0

(iii) For every k and all n ≥ k + 1

(n) n k+1 fk (y0 ) = fk(y0 ) .(4.5.4) 52 CHAPTER 4. HIDDEN GIBBS PROCESSES Chapter 5

Hidden Gibbs Fields

Part2 : General Theory of Hidden Gibbs Models : The observable process is a function, either deterministic or random, of the underlying “signal” process, governed by Gibbs distribution.

Random processes d = 1

{Xn}n∈Z → {Yn}n∈Z

Random fields d > 1 { } → { } Xn n∈Zd Yn n∈Zd Many natural questions

• Is the image of a Gibbs measure again Gibbs? (⇔ Is {Yn} Gibbs?) • What are the probabilistic properties of images Gibbs measures? • If the image measure is indeed Gibbs, can we determine the corre- sponding interaction?

• Given the output {Yn}, what could be said about {Xn}?

• Given {Yn}, either Gibbs or not, can we efficiently approximate it by a Gibbs process?

Dimension d = 1: It has been known for a long time, that functions of Markov (hence, also of Gibbs) process can be “irregular”:

Yn = XnXn+1, Xn ∼ Ber (p) , Xn ∈ {−1, 1}.

53 54 CHAPTER 5. HIDDEN GIBBS FIELDS

ext Here you can “attribute” singularity to the fact that Xn = (Xn, Xn+1) is not fully supported. Functions of fully supported Markov chains (⇔ P > 0) are Gibbs.

In dimension d > 1 (realm of Statistical Physics), one typically works with fully supported Gibbs state. It came as a surprise that functions of simple fully supported Markov random fields (like Ising Model) can be non-Gibbsian. History is well documented : see papers (=books) by van Enter, Fernandez, Sokal, Le Ny, Bricmont, Kupiainen, Lefevere, . . .

5.1 Renormalization Transformations in Statisti- cal Mechanics

Textbook examples of RG transformations:

{ } = d • Random field σn n∈Zd , lattice L Z . d { } ˜ ⊆ • Cover lattice Z by boxes Bx x∈L˜ , L L, which are disjoint/non- overlapping.

For every box Bx, let σ˜x = F({σn|n ∈ Bx}) random or deterministic.

Examples:

• Decimation.

σ˜n = σ2n.

• Majority rule: σn ∈ {±1}, σ˜x = majority of σn’s, n ∈ Bx. 5.2. EXAMPLES OF PATHOLOGIES UNDER RENORMALIZATION 55

• Kadanoff’s copying with noise

! 1 P(σ˜x = 1|σn, n ∈ Bx) = exp β σn , Z ∑ n∈Bx ! 1 P(σ˜x = −1|σn, n ∈ Bx) = exp −β σn . Z ∑ n∈Bx

• Spin flip dynamics (Glauber spin flips)

– Each n ∈ Zd, has an independent Poissonian clock with inten- sity λ.

– Every time the clock rings, the spin is flipped σn → σ¯n. d =⇒ Continuous time Markov process on {±1}Z .

t P0 −→ Pt = S P0.

St But for fixed t > 0, P0 −→ Pt, where ( t σn, w.p. 1 − εt, (S σ)n = σ¯n, w.p. εt,.

1 −2λt ∀ Exercise: Show that εt = 2 (1 − e ), t ≥ 0.

5.2 Examples of pathologies under renormaliza- tion

d [ L = Z = Bx, x∈L˜ d ˜ σ ∈ AZ = AL −→ σ˜ ∈ BL.

Physical point of view exp(−β˜H˜ (σ˜ )) = ∑ σ˜ exp(−βH(σ)). σ 56 CHAPTER 5. HIDDEN GIBBS FIELDS

E.g.

H = ∑ Jiσi + ∑ Jijσiσj + ∑ Jijkσiσjσk, i,j i,j,k ˜ ˜ ˜ H = ∑ Jiσ˜i + ∑ Jijσ˜iσ˜ j + ··· . i,j

Try to define: J = (Ji, Jij,... ) → J˜ = (J˜i, J˜ij,... )

• It was implicitly assumed that renormalization in always successful: ∃R : Potentials −→ Potentials.

• First examples where some "difficulties"/pathologies were observed, appeared in early 1980’s.

• Soon it was explained that the source of problems is the lack of quasi-locality.

As we have seen, by Kozlov-Sullivan’s Theorem

P is Gibbs ⇐⇒ P is consistent with some specification γ = {γΛ(·|·)}ΛbL which is ∀ ∃ • uniformly non-null: Λ b L, 0 < αΛ < βΛ < 1 such that

0 < αΛ ≤ γΛ(σΛ|σΛc ) ≤ βΛ < 1.

∀ • quasilocal (regular): Λ b L,

sup |γΛ(σΛ|σW\ΛηWc ) − γΛ(σΛ|σW\ΛξWc )| −→ 0 as W % L. σ,η,ξ

In principle, both properties can fail under renormalization. However, d suppose P ∈ G(γ), P is a measure on AZ . A Function ϕ : A → B (onto), d d −1 ∀ d extends to AZ → BZ . Let Q = ϕ∗P = P ◦ ϕ . Then Λ b Z :

Q(σ˜Λ) = ∑ P(σΛ), −1 σΛ∈ϕ (σ˜Λ) and

−1 (σ ) ∑σΛ∈ϕ (σ˜Λ) P Λ Q(σ˜0|σ˜ Λ\{0}) = −1 (σ ) ∑σΛ\{0}∈ϕ (σ˜Λ\{0}) P Λ\{0}   P(σΛ\{0}) = ∑ P(σ0|σΛ\{0}) , −1 ∑σ¯ ∈/ϕ−1(σ ) P(σ¯Λ\{0}) σΛ∈ϕ (σ˜Λ) Λ\{0} Λ\{0}

Q(σ˜0|σ˜Λ\{0}) = ∑ P(σ0|σΛ\{0})W(σΛ\{0}), −1 σΛ∈ϕ (σ˜Λ) 5.2. EXAMPLES OF PATHOLOGIES UNDER RENORMALIZATION 57

• P(σ0|σΛ\{0}) converge to the limit ∈ [α{0}, β{0}],

• {W(σΛ\{0})} are probability weights.

=⇒ Q inherits uniform non-nullness from P.

Conclusion: Thus it is only possible that quasi-locality fails: Q is not consistent with any quasi-local specification. Absence of quasi-locality

Definition: A measure P on Ω is not quasi-local at ω ∈ Ω if there exists Λ b L and quasi-local f : Ω → R such that no "version"/realization of L EP( f |FΛc )(·) is quasi-local at ω. (Finite spin space Ω = A , quasi-locality is equal to continuity in product topology.)

Problem: EP( f |FΛc )(·) are defined P-a.s. It seems impossible to check "all" versions.

Remark: Continuity means Feller property: γΛ( f ) ∈ C(Ω) for all f ∈ C(Ω).

Theorem 5.1 (Existence of a bad configuration). Suppose P is uniformly non- ∃ ∀ ∃ null. A point ω ∈ Ω is called a bad point if δ > 0, Λ b Zd, V b Zd, Λ ⊂ V, and η, ξ ∈ Ω such that

+ − µ(ω0|ωΛ\{0}ηV\ΛσVc ) − µ(ω0|ωΛ\{0}ξV\ΛσVc ) ≥ δ for all σ+, σ− ∈ Ω. If ω ∈ Ω is a bad point, then P is not quasi-local.

Remark: Existence of a bad configuration for P is "stronger" than non- quasilocality of P. But in all known cases, non-quasilocality of P has been established by presenting bad ω. How to identify bad points?

Hidden phase transitions scenario:(Van Euter Fernandez Sokal) "Loss of Gibbsianity occurs when there is a hidden phase transition in the original system conditioned on the image spins."

−1 • σ˜ ∈ Ω˜ image configuration Xσ˜ = ϕ (σ˜ ) ⊆ Ω fibre over σ˜ . ∀ ˜ • If |GXσ˜ (φ)| = 1, σ˜ ∈ Ω, then ν = ϕ∗µ is Gibbs.

• If |GXσ˜ (φ)| > 1, for some σ˜ , then ν = ϕ∗µ is not Gibbs. 58 CHAPTER 5. HIDDEN GIBBS FIELDS

• Hypothesis is true in all known cases!

• Phase transitions: e.g. Ising-type models. ∃ ∀ βc = βc(φ) : β < βc, |G(βφ)| = 1, ∀ β > βc, |G(βφ)| > 1.

• Take β < βc =⇒ there is a unique phase.

Conditioning on image spins = lowers the temperature beyond critical value |GXσ˜ (βφ)| ≥ 2. Examples

• Decimation of the Ising Model

H = −J ∑ σiσj i∼j

Theorem 5.2.√Let P be any Gibbs measure for the 2D Ising model with J > 1 −1 ∗ 2 cosh (1 + 2) ' 0.769 . . . and 0 magnetic field. Then Q = T2 P is not consistent with any quasi-local specification.

Kadanoff’s transformation:

{ } → { } σn n∈Zd σ˜n n∈Zd epσn e−pσn Π(σ˜n = 1 | σn) = , Π(σ˜n = −1 | σn) = 2 cosh(pσn) 2 cosh(pσn)

Theorem 5.3. Suppose d > 1, and 0 < p < ∞. Then ∃J0 = J0(d, p):

∀J > J0, any Gibbs measure for H = −J ∑ σiσj P, renormalized measure Q is not Gibbs. i∼j

Hidden Markov 1 → BSC ε = 1 + e2p 5.2. EXAMPLES OF PATHOLOGIES UNDER RENORMALIZATION 59

Repeated application of Kadanoff’s transformations:

P[σ˜n = σn] = 1 − ε, P[σ˜n 6= σ] = ε.

Potts Model: d Ω = {1, . . . , q}Z , A = {1, . . . , q}, q-spins / colors   1   γΛ(σΛ | σΛc ) = exp 2β ∑ I[σi = σj] z i∈Λ j∼i

Phase Diagram (Aizenmann-Chayes-Chayes-Newman)

There exists βc = βc(q, d) ∈ (0, +∞):

• ∀β < βc (High Temperature) ∃! unique Gibbs state

• ∀β > βc (Low Temperature) ∃ multiple Gibbs states.

Fuzzy : (Maes, Vande Velde, Haggstrom)

ϕ : {1, . . . , q} → {1, . . . , s}, s < q  1, σ ∈ {1, . . . , r }  n 1  2, σn ∈ {r1 + 1, . . . , r1 + r2} σ˜n = ϕ(σn) = . .   s, σn ∈ {r1 + ··· + rs−1 + 1, . . . , r1 + ··· + rs = q}

Agreement: r1 = min{ri | i > 1}. Theorem 5.4. Q = ϕ∗P = P ◦ ϕ−1 (P = G(βφ)) is

• Gibbs, if β < βc(r1, d) +( − ) ( ) • not Gibbs, if β > β˜ = 1 log 1 r1 1 pc d 1 2 1−pc(d) 60 CHAPTER 5. HIDDEN GIBBS FIELDS

5.3 General Properties of Renormalised Gibbs Fields

First Fundamental Theorem

Theorem 5.5. If P1, P2 ∈ GS(φ) (Gibbs for the same φ), then Q1 = T∗P1, Q2 = T∗P2 are either

• Both non-Gibbs

• or, Gibbs for the same interaction ψ.

⇒ Single-valuedness of RG transformation.

• Acts on interactions / Hamiltonians, not “measures”.

• Excludes possibility Q1 ∈ G(ψ1), Q2 ∈ G(ψ2): g(ψ1) 6= g(ψ2).

RG Relation:

RT = {(φ, ψ) : ∃µ ∈ GS(φ), ∃ν ∈ GS(ψ) : ν = T∗µ}

Second Fundamental Theorem:

If (φ0, ψ0) ∈ R, (φ1, ψ1) ∈ RT, then

|||ψ0 − ψ1||| ≤ K|||φ0 − φ1|||, where ||| · ||| is the norm / modulo physical equivalence. ⇒ Continuity of RG maps.

5.4 Dobrushin’s reconstruction program

• Classify possible singularities of renormalized Gibbs states.

• Try to recover some basic thermodynamic results (e.g. Variational Principle). 5.4. DOBRUSHIN’S RECONSTRUCTION PROGRAM 61

5.4.1 Generalized Gibbs states/“Classification” of singu- larities

Reminder: P is Gibbs if P is consistent with quasi-local specification Γ: (P ∈ G(Γ)):

∀ | ( | c ) − ( | c ) | → → Λ : sup γΛ ωΛ ωΛn\ΛηΛn γΛ ωΛ ωΛn\ΛωΛn 0 asn ∞ ω,η

Specification Γ = {γΛ} might not be quasi-local:

= { ∀ | ( | c ) − ( | c ) | → → } ΩΓ : ω : Λ sup γΛ ωΛ ωΛn\ΛηΛn γΛ ωΛ ωΛn 0 asn ∞ η

Definition: P is almost Gibbs if P is consistent with specification Γ: (P ∈ G(Γ)) and P(ΩΓ) = 1.

Gibbs specifications are given by

φ 1 φ γ (ω | ω c ) = γ (ω | ω c ) = exp(−H (ω ω c )) Λ Λ Λ Λ Λ Λ z Λ Λ Λ

φ φ φ is uniformly absolutely convergent ⇒ Γ = {γΛ} is quasi-local. Hamiltonian φ ( ) = ( ) HΛ ω ∑ φ V, ωV V∩Λ6=∅ might not converge everywhere.

φ Ωφ = {ω : HΛ(ω) converges ∀Λ}.

Definition: P is called weakly Gibbs for φ if

φ P(Ωφ) = 1 and P ∈ G(Γ ).

Almost Gibbs: quasi-locality a.e. Weakly Gibbs: convergence a.e. Definition: P is called intuitively weakly Gibbs for φ if P is weakly Gibbs for φ:

φ • P(Ωφ) = 1, HΛ converges on Ωφ and P ∈ G(Γ ). and 62 CHAPTER 5. HIDDEN GIBBS FIELDS

reg reg • Ωφ ⊆ Ωφ, P(Ωφ ) = 1:

φ φ ( | c ) → ( | c ) γΛ ωΛ ωΛn\ΛηΛn γΛ ωΛ ωΛ

reg for all ω, η ∈ Ωφ . Theorem 5.6. G ( AG ( IWG ⊆ WG.

Open problem: IWG ( WG.

u u ( | c ) → ( | c ) γΛ ωΛ ωΛn\ΛηΛn γΛ ωΛ ωΛ

• Gibbs: for all ω and for all η.

• Almost Gibbs: for a.a. ω and for all η.

• Intuitively weak Gibbs: for a.a. ω and for a.a. η.

d Definition: Let Ω = AZ . A function f : Ω → R is called quasi-local at ∈ ∈ ( ) = ( c ) ω Ω in the direction θ Ω if f ω limΛ%Zd f ωΛθΛ . Remark: f is quasi-local at ω ⇐⇒ f is quasi-local in all directions θ ∈ Ω.

Specification Γ = {γΛ(·|·)} is called quasi-local at ω ∈ Ω in the direction ∀ θ ∈ Ω if Λ, γΛ(ωΛ|ωΛc ) = fΛ(ω) is quasi-local at ω in the direction θ. ∀ ∀ ⇐⇒ γΛg is quasi-local at ω in the direction θ ( Λ, quasi-local g).

Consider the following sets:

θ θ Ωqloc(Γ) = Ωq(Γ) = {ω ∈ Ω : Γ is quasi-local at ω in the direction θ},

Ωqloc(Γ) = Ωq(Γ) = {ω ∈ Ω : Γ is quasi-local at ω}. 5.4. DOBRUSHIN’S RECONSTRUCTION PROGRAM 63

Proposition 5.7. • If P is consistent with a specification Γ and ∃θ ∈ Ω such θ that P(Ωq(Γ)) = 1. Then P is weakly Gibbs. • If P is intuitively weak Gibbs, then it is consistent with specification Γ :  θ  P θ ∈ Ω : P(Ωq(Γ)) = 1 = 1.

• If P is almost Gibbs, then it is consistent with specification Γ :

P(Ωq(Γ)) = 1.

Möbius Transform: E is a finite set. Two collections

F = {FA}A⊆E , G = {GA}A⊆E . Then the following are equivalent:

∀ • A : GA = ∑B⊆A FB. ∀ |A\B| • A : FA = ∑B⊆A(−1) GB. • Gibbs specifications: 1 γ(ωΛ|σΛc ) = exp(−HΛ(ωΛσΛc )) Z(σΛc ) γΛ(ωΛ|θΛc ) =⇒ = exp(−[HΛ(ωΛθΛc ) − HΛ(θΛθΛc )]) γΛ(θΛ|θΛc ) • Fix θ ∈ Ω. ˜ ˜ ∑ φV(ωΛθΛc ) = HΛ(ωΛθΛc ) V∩Λ6=∅ γ (ω |θ c ) = − log Λ Λ Λ . γΛ(θΛ|θΛc ) Vacuum Potential:

θ |A\Λ| γΛ(ωΛ|θΛc ) φΓ(A, ω) = − (−1) log ∑ ( | c ) Λ⊆A γΛ θΛ θΛ

γΛ(ωΛ|θΛc ) θ ⇐⇒ log = − φΓ(A, ω). ( | c ) ∑ γΛ θΛ θΛ A⊆Λ θ θ −HΛ(ω) = − ∑ φΓ(A, ω) A∩Λ6=∅ θ θ = − ∑ φΓ(A, ω) − ∑ φΓ(A, ω), A⊂Λ A∩Λ6=∅,A∩Λc6=∅ θ HΛ(ωΛθΛc ) = − ∑ φΓ(A, ωA) + 0. A⊂Λ 64 CHAPTER 5. HIDDEN GIBBS FIELDS

Definition: φ = {φ(Λ, ·)} has vacuum θ ∈ Ω if φ(A, ωA) = 0 if ωi = θi for some i ∈ A. Therefore in the sum above:

∑ φ(A, ωAθAc ) = 0. A∩Λ6=∅,A∩Λc6=∅

θ |A\Λ| γΛ(ωΛ|θΛc ) Exercise: φ (A, ω) := − ∑Λ⊆A(−1) log has vacuum θ. Γ γΛ(θΛ|θΛc )

θ |A\Λ| γΛ(ωΛ|θΛc ) φΓ(A, ω) = − (−1) log ∑ ( | c ) Λ⊆A γΛ θΛ θΛ is the desired vacuum potential. But

• Potential might not be convergent.

• Even if it is convergent, it might not be consistent with Γ.

θ Proposition 5.8. Vacuum potential φΓ is convergent in ω and consistent with θ Γ in ω if and only if ω ∈ ΩΓ.

( | c ) θ γΛ ωΛ ωL\ΛθL φΓ(A, ω) = − log . ∑ ( | c ) A⊆L,A∩Λ6=∅ γΛ θΛ ωL\ΛθL

5.4.2 Variational Principles

d Definition: Suppose P, Q are translation invariant measures on Ω = AZ . Then the relative entropy density rate

1 Q(ω) h(Q|P) = lim Q(ω) log , n→∞ |Λ | ∑ (ω) n ω∈AΛn P provided the limit exists.

Remark: If P is Gibbs for UAC potential φ, the followings hold:

Theorem 5.9. then h(Q|P) exists for any Q ∈ MS(Ω). Theorem 5.10 (Variational Principle). If P ∈ G(φ) is a translation invariant measure, and Q is a translation invariant measure, then h(Q|P) = 0 ⇔ Q ∈ G(φ). 5.4. DOBRUSHIN’S RECONSTRUCTION PROGRAM 65

1 Q(ωΛn ) h(Q|P) = lim Q(ωΛn ) log d ∑ Λn% |Λn| (ω ) Z ωΛn P Λn 1 = −h(Q) lim Q(ωΛn ) log P(ωΛn ). d ∑ Λn% |Λn| Z ωΛn

For Gibbs P: Z h(Q|P) = P( fφ) − h(Q) − fφdQ ≥ 0.

Variational Principles for Generalized Gibbs states

Theorem 5.11. Suppose h(Q|P) = 0. Then

Q(σo|σoc ) = P(σo|σoc ) (Q-a.s.)

But what does it mean for general Q and P, which are singular.

Q(σo|σoc ) P(σo|σoc ) defined Q-a.s. defined P-a.s.

One can conclude very little if, for example, P is not regular in some sense and Q “sits” on a set of P-regular points.

Side remark: ∃P: h(Q|P) = 0 ∀Q.

Category of Weakly Gibbsian Measures you can’t do anything “sensible”:

Theorem 5.12. There exist weakly Gibbsian measures P1, P2:

h(P1|P2) = h(P2|P1) = 0, but P1 ∈/ G(Γ2) and P2 ∈/ G(Γ1), where Γi is the W.-G. specification for Pi, i = 1, 2.

Category of Almost Gibbs Measures: 66 CHAPTER 5. HIDDEN GIBBS FIELDS

Theorem 5.13. P is almost Gibbs for Γ: P(ΩΓ) = 1. Q is concentrated on ΩΓ: Q(ΩΓ) = 1. Then h(Q|P) = 0 ⇒ Q ∈ G(Γ), and hence Q is almost Gibbs as well. Theorem 5.14. P ∈ G(Γ), then

˜ = { ( | ) → ( | c )} ΩΓ ω : µ σΛ ωΛn\Λ γΛ σΛ ωΛ

has P-measure 1. If

h(Q|P) = 0 and Q(Ω˜ Γ) = 1,

then Q ∈ G(Γ).

This implies the variational principle for intuitively weakly Gibbsian states.

d −1 Theorem 5.15. Suppose P ∈ G(φ) on Ω = AZ , and Q = ϕ∗P = P ◦ ϕ d d where ϕ : AZ → BZ is a single-site renormalization transformation. Then Q might be non-Gibbs, but

Zd •h (Q˜ |Q) exists for any Q˜ ∈ Ms(B ), −1 • if h(Q˜ |Q) = 0, then there exists P˜ ∈ G(φ) : Q˜ = ϕ∗P˜ = P˜ ◦ ϕ .

Complete recovery of Variational Principle for Gibbs states in case of Hidden Gibbs states.

The key is the following result : if Q = ϕ∗P,

h(Q˜ |Q) = inf h(P˜ |P), ϕ∗P˜ =Q˜

and the infimum is achieved.

h(Q˜ |Q) = 0 ⇒ h(P˜ |P) = 0 ⇒ P˜ is Gibbs.

Zd Theorem 5.16 (First Fundamental Theorem). If P1, P2 ∈ G(φ) on A , ϕ : Zd Zd A → B single-site, then Q1 = ϕ∗P1, Q2 = ϕ∗P2 are either both Gibbs for the same potential ψ, or non-Gibbs.

0 ≤ h(Q1|Q2) ≤ h(P1|P2) = 0 ⇒ h(Q1|Q2) = h(Q2|Q1) = 0.

If, say Q2 is Gibbs, then h(Q1|Q2) = 0 implies that Q1 is Gibbs for the same potential. 5.4. DOBRUSHIN’S RECONSTRUCTION PROGRAM 67

Corollary 5.17 (Corollary of the First Fundamental Theorem). Suppose G ( ) → G ( ) 7→ (G ( )) ⊆ G ( ) ϕ∗ : AZd φ BZd ψ , P Q, then ϕ∗ AZd φ BZd ψ .

Variational Principle tells us that

(G ( )) = G ( ) ϕ∗ AZd φ BZd ψ (onto).

“Conjecture": Modulo trivial counter-examples

G ( ) → G ( ) ϕ∗ : AZd φ BZd ψ is a bijection.

RG relation :

 d   φ is UAC interaction on AZ ,     d  Rϕ = (φ; ψ) : ψ is UAC interaction on BZ ,      ∃P ∈ G (φ) , Q ∈ G (ψ) : Q = ϕ∗P 

It is a relation by First Fundamental Theorem.

Rϕ(ψ) = {φ : (φ, ψ) ∈ Rϕ} fibre over ψ.

Rϕ(ψ) is typically non-trivial : ϕ is many-to-one.

Theorem 5.18. Ψ1, Ψ2 are uniformly absolutely convergent interactions on Zd B , then Rϕ(ψ1) and Rϕ(ψ2) are isomorphic. If (Φ1, Ψ1) ∈ Rϕ, then Φ2 = Φ1 − ψ1 ◦ ϕ + ψ2 ◦ ϕ ⇒ (φ2, ψ2) ∈ Rϕ. (φ1, ψ1) ∈ Rϕ ⇒ φ2 = φ1 − ψ1 ◦ π + ψ2 ◦ π : (φ2, ψ2) ∈ Rϕ.

Theorem 5.19 (Second Fundamental Theorem). If ϕ(φ1) = ψ1, ϕ(φ2) = ψ2, then |||ψ1 − ψ2||| ≤ K · |||φ1 − φ2|||. 68 CHAPTER 5. HIDDEN GIBBS FIELDS

Claim: K = 1. What is ||| · |||? H is a supremum norm on fφ / modulo physical equiva- = { ( ·)} = { ( ·) + lence. Suppose φ φ Λ, ΛbZd is UAC interaction, and φe φ Λ, ˜ } G( ) = G( ˜) φ = φ ∀ cΛ Λ∈Zd is also UAC, then φ φ (since γΛ γΛ Λ).

|||φ1 − φ2||| = inf || f φ1 − f φ2 + g − c||. g is coboundary, c is constant

After last slides Zd d d Zd Zd Definition: f : A → R, Λn := [0, n − 1] ∩ Z , T : A → A ,     1   P( f ) := lim log exp  sup f (Tk ) . n→∞ | | ∑ ∑ ω Λn Λn ω : ω =ω ∈  ωΛn ∈A  Λ Λ k Λn  ω=ωΛ1Λc ω=ωΛηΛc

d Theorem 5.20 (Weak Variational Principle). f ∈ C(AZ ),  Z  P( fΦ) = sup h(P) + fΦdP .

R P is an equilibrium state if P( fΦ) = h(P) + fΦdP. (1) Z ! 1 k P( f | P) = lim log exp ∑ f (Tω) P(dω), n→∞ |Λn| k∈Λn provided the limit exists. — P is Gibbs, P ∈ G(Φ).  

 ( k ) ( ) log ∑ exp  ∑ f Tω  P ωΛn Λn  ∈  ωΛn ∈A k Λn ωΛn =ωΛn ! ! k k = log ∑ exp ∑ f (Tω) exp ∑ fΦ(Tω) − |Λn|P( fΦ) Λn ∈ ∈ ωΛn ∈A k Λn k Λn ! k k = log ∑ exp ∑ f (Tω) + ∑ fΦ(Tω) − |Λn|P( fΦ) . Λn ∈ ∈ ωΛn ∈A k Λn k Λn

P( f | P) = P( f + fΦ) − P( fΦ). 5.4. DOBRUSHIN’S RECONSTRUCTION PROGRAM 69

(2) Z h(Q | P) = P( fΦ) − h(Q) − fΦ dQ. Theorem 5.21. Z  P( f | P) = sup f dQ − h(Q | P) , Q Z  h(Q | P) = sup f dQ − P( f | P) . f ∈C(AZd )

∈ BZd = −1( ) = { ( ) = } = −1( ) A η , Xη ϕ η ω : ϕ ω η ∏k∈Zd ϕ ηk j , ! 1 ( )( ) = ( | ) = ( k ) P f , ϕ η P f Xη : lim log ∑ exp sup ∑ f Tω . n→∞ |Λn| ωΛn : ϕ(ωΛn )=ηΛn k∈Λn

d d d R d AZ , ϕ : A → {1}, AZ → 1Z , : BZ , d P( · , ϕ)(η) : C(AZ ) → R. convex real valued f

|P( f1, ϕ)(η) − P( f2, ϕ)(η)| 5 k f1 − f2k0. P( f , ϕ)(η) = P( f , ϕ)(Skη)

Zd Zd Zd d Theorem 5.22. ϕ : A → B , f ∈ C(A ), Q ∈ MS(BZ ). Then Z  Z  h(Q) + P( f , ϕ)(y)Q(dy) = sup h(P) + f dP BZd AZd Q=ϕ∗P Theorem 5.23. ∈ ES ( ) ⇐⇒ = ∈ ES ( ( )) P AZd f Q ϕ∗P BZd P f , π , and P is a relative equilibrium state for f over Q.

Suppose Q = ϕ∗P, P ∈ G(Φ) = ES( fΦ), Z  Z  h(Q) + P( fΦ, ϕ) dQ = sup h(Qe ) + P( fΦ, ϕ) dQe .

Suppose Q ∈ G(Ψ) = ES( feΨ), Z  Z  h(Q) + feΨ dQ = sup h(Qe ) + feΨ dQe .

Here ∗ (ω) 1 k fΨ = lim ∑ fΨ(Sω), n→∞ |Λn| k∈Λn

1 k lim ∑ fΨ(Sω) = P( fφ, π), Q-a.s. for every ergodic Q. n→∞ |Λn| k∈Λn 70 CHAPTER 5. HIDDEN GIBBS FIELDS

5.5 Criteria for Preservation of Gibbs property under renormalization

d d Suppose P ∈ G(ψ), Q = ϕ∗P ∈ G(ψ), where ϕ : AZ → BZ is a single-site transformation.

−1 Q = ϕ∗P = P ◦ ϕ ⇔ Q(yΛ) = ∑ P(xΛ). −1 xΛ∈ϕ (yΛ)

( ) = ( ) Q yΛn ∑ P xΛn . −1 xΛn ∈ϕ (yΛn )

• If P is Gibbs, then  P(xΛn ) = exp (SΛn fϕ(x)) − |Λn|P( fϕ) ∗ DΛn (x).

• If Q is Gibbs, then  Q(yΛn ) = exp (SΛn fψ(y)) − |Λn|P( fψ) ∗ DeΛn (y), where k (SΛg)(z) = ∑ g(Sz) k∈Λ and

log DΛn (x), log DeΛn (y) = o(|Λn|).

Therefore, ( ) = ( k) − | | ( ) + (| |) log Q yΛn ∑ fψ Sy Λn P fψ o Λn k∈Λn ! k = log ∑ exp ∑ fϕ(Tx ) − |Λn|P( fϕ) + o(|Λn|). −1 ∈ xΛn ∈ϕ (yΛn ) k Λn

Divide by |Λn| and take lim sup/lim 1 1 o(|Λ |) ( ) = ( k) − ( ) + n → ∗( ) − ( ) log Q yΛn ∑ fψ Sy P fψ fψ y P fψ |Λn| |Λn| |Λn| k∈Λn & P( fϕ, π)(y) − P( fϕ). Thus, ∗ fψ(y) − P( fψ) = P( fϕ, π)(y) − P( fϕ). 5.5. CRITERIA FOR PRESERVATION OF GIBBS PROPERTY UNDER RENORMALIZATION71

Lemma 5.24. If P ∈ G(ϕ), Q = ϕ∗P ∈ G(ψ), then

∗ fψ(y) = P( fϕ, π)(y) + const, where ∗ 1 k fψ(y) = lim sup ∑ fψ(Sy), → |Λn| n ∞ k∈Λn

P( fϕ, π)(y) is the relative pressure of f on the fibre Xy,

Zd −1 Xy = {x ∈ A : ϕ(x) = y} = ϕ (y).

Problem/Question: Given P( fϕ, π)(y), translation invariant function on Zd Zd ∗ B , decide whether there exists “smooth” fψ : B → R such that fψ(y) = P( fϕ, π)(y). After last slides

Z 1 Z f (x)µc(dx) = f (x)µ(dx), c &{y}. ν(c) π−1(c) 1 Z f (x)µ(dx), ce & . µ(ce) ce y n −1 n Pn := P(·|y0 ) = P(·|π (y0 )) y y Pn → P y y Theorem 5.25 (Tjur). Suppose y0 ⊆ y; Pn → P as n → ∞, and y ∈ Y is Tjur y point. Then, the map y0 → M(X), y → P is continuous on Y0.

−1 Theorem 5.26. Q = P ◦ π . If y0 = y, then Q is Gibbs.

y ∞ Take y ∈ Y, {Pn}n=1: measures on Xy (∈ M(Xy)).

y ∞ y (GXy (ϕ) = Gϕ(Xy) ⊇)my := {Pn}n=1 = {all limit points of Pn}.

• |GXy (ϕ)| = 1 for any y if and only if Q is Gibbs.

• If |my| = 1 for any y, then Q is Gibbs. The converse is an open problem. 72 CHAPTER 5. HIDDEN GIBBS FIELDS Part II

Applications

73

Chapter 6

Hidden Markov Models

Hidden Markov Models, briefly discussed in Chapter 3 , are extremely popular models with a broad range of applications, from biological se- quence analysis to speech recognition. The theory, as well as possible applications, are very well covered in the literature [20–24]. In this chapter we briefly review basic facts of the theory of Hidden Markov Models and discuss thermodynamic aspects of some of this the- ory.

6.1 Three basic problems for HMM’s

As defined earlier, the Hidden Markov Process , is a random function of a Markov Process. More specifically, suppose {Xn} is Markov chains with values in A, transition probability matrix P and initial distribution π. Then {Yn} is constructed as follows: independently for every n, Yn is chosen according to some emission distribution, which only depends on the value of the underlying Markov chain Xn:

P(Yn = j|Xn = u) = Πij ∀i ∈ A, j ∈ B.

The key property of Hidden Markov chains is that Yn depends only on the value of Xn:

75 76 CHAPTER 6. HIDDEN MARKOV MODELS

The Markov processes are characterized by the Markov property: the future (Xn+1, Xn+2,...) and the past (..., Xn−2, Xn−1) are conditionally independent, given the present Xn:

..., Xn−3, Xn−2, Xn−1 ⊥Xn Xn+1, Xn+2,...

The Hidden Markov processes inherit this property: the future (Yn+1, Yn+2,...) and the past (..., Yn−2, Yn−1) are conditionally independent, given the present Xn:

..., Yn−3, Yn−2, Yn−1 ⊥Xn Yn+1, Yn+2, while

..., Yn−3, Yn−2, Yn−1 6 ⊥Yn Yn+1, Yn+2,

unless {Yn} is Markov. The Hidden Markov Process is fully determined by λ = (π, P, Π). As in the case of functions of Markov chains, we will denote by Q = Qλ the law of {Yn} with parameters λ. We will continue to use P to denote the joint distribution of {Xn} and {Yn}. There are three important questions one should be able to answer:

(1) The Evaluation Problem: Given the parameters λ = (π, P, Π) of the model, what is the efficient method to evaluate Q [Y0:n = y0:n]? Note that

Q [Y0:n = y0:n] = ∑ P [X0:n = x0:n] Π [Y0:n = y0:n|X0:n = x0:n] n+1 x0:n∈A n−1 n = ∑ πx0 ∏ Pxi,xi+1 ∏ Πxi,yi , n+1 x0:n∈A i=0 i=0

and the sum involves exponentially many terms.

(2) The Decoding or State Estimation Problem: Given the observations Y0:n = y0:n and assuming that the parameters (π, P, Π) are know, find the most probable “input” realization X0:n?

(3) The Learning Problem: Given the observations Y0:n = y0:n, estimate the parameters λ = (π, P, Π). For example, one could be interested in the maximum likelihood estimate

ˆ λ = argmaxλQλ(Y0:n = y0:n). 6.1. THREE BASIC PROBLEMS FOR HMM’S 77

6.1.1 The Evaluation Problem

It turns out that one does not need to evaluate exponentially many terms n in order to "score" the sequence Y0 . For 0 ≤ m ≤ n, let us introduce the forward variables

αm(i) = P [Y0 = y0,..., Ym = ym, Xm = i] , i ∈ A, and the backward variables

βm(i) = P [Ym+1 = ym+1,..., Yn = yn|Xm = i] , i ∈ A.

 Exercise 6.1. Show that the forward and backward variables satisfy the following recurrence relations:

0 (i) = (i) (i) = (i ) · p 0 · m = n − α0 π Πiy0 , αm+1 ∑ αm i i Πiym+1 , 0, . . . , 1, i0∈A and

0 (i) = (i) = (i )p 0 · 0 m = n − βn 1, βm ∑ βm+1 ii Πi ,ym+1 , 0, . . . , 1. i0∈A

Proposition 6.2. For every 0 ≤ m ≤ n, one has

Q [Y0 = y0,..., Yn = yn] = ∑ αm(i)βm(i). i∈A

Proof. One has

Q [Y0:n = y0:n] = ∑ P [Y0:n = y0:n, Xm = i] i = ∑ P [Y0:n = y0:n|Xm = i] P [Xm = i] i = ∑ P [Y0:m = y0:m, Ym+1:n = ym+1:n|Xm = i] P [Xm = i] i = ∑ P [Y0:m = y0:m|Xm = i] P [Ym+1:n = ym+1:n|Xm = i] P [Xm = i] i = ∑ P [Y0:m = y0:m, Xm = i] P [Ym+1:n = ym+1:n|Xm = i] i = ∑ αm(i)βm(i). i 78 CHAPTER 6. HIDDEN MARKOV MODELS

6.1.2 The Decoding or State Estimation Problem

n n+1 Suppose we observed y0 ∈ B as the realization of the Hidden Markov n chain Y0 , and now we would like to estimate the most "probable" sequence of states that lead to this output. Let us introduce the following probability distributions on A:

n n Pt = P(Xt = ·|Y0 = y0 ), t = 0, . . . , n.

For t < n, Pt is called a smoothing probability; for t = n, Pt is called the filtering probability. The definition also makes sense for t > n, then one speaks of the prediction probabilities.

If we know or estimate Pt, we could estimate the state Xt as ˆ n n Xt = argmax Pt(i) = argmax P(Xt = i|Y0 = y0 ). i∈A i∈A

That is an optimal choice from the Bayesian point of view, as it leads to the minimal expected loss

n 1 ˆ ∑ I[Xt 6= Xt]. n t=0

One the other hand, it is very well possible that as a realization of a Markov chain, {Xˆ t : t = 0, . . . n} has very low probability (possibly, even proba- bility 0). That is why one sometimes applies the maximum a posteriori (MAP) decision rule:

ˆ n n n n n X0 = argmax P[X0 = x0 | Y0 = y0 ] (6.1.1) n n+1 x0 ∈A

The famous Viterbi’s algorithm provides an efficient method to find max- imizers in (6.1.1). However, Viterbi’s algorithm does not necessarily min- imize the number of “symbol” errors. We will concentrate on studying properties of smoothing probabilities.

6.2 Gibbs property of Hidden Markov Chains

In Chapter ??, we have stated without a proof that functions of Markov chains with strictly positive transition probability matrices are regular. In this section, we give a proof of this theorem in case of Hidden Markov chains. 6.2. GIBBS PROPERTY OF HIDDEN MARKOV CHAINS 79

Theorem 6.3. Suppose {Yn} is a Hidden Markov Chain for a stationary Markov chain {Xn} with P > 0 and emission matrix Π > 0. Then {Yn} is regular, i.e., Q is a g-measure, as well as Gibbs.  Exercise 6.4. Under conditions of Theorem 6.3, there exist α and β such that n n 0 ≤ α ≤ Q[Y0 = y0|Y1 = y1 ] ≤ β < 1 for all n ∈ N and y ∈ BZ+ .

Z+ Proof. For any y = (y0, y1,... ) ∈ B let n n n gn(y) = gn(y0 ) = Q[Y0 = y0|Y1 = y1 ], n n gn(y, i) = P[Y0 = y0|Y1 = y1 , Xn+1 = i], g¯n(y) = max gn(y, i), gn(y) = min gn(y, i). i i

Z+ Step 1: gn(y) ≤ gn(y) ≤ g¯n(y) for all y ∈ B . Since

n n n gn(y) = Q[y0|y1 ] = ∑ P(y0|y1 , Xn+1 = i) · P(Xn+1 = i|y1 ), i we have ! n gn(y) ≤ g¯n(y) · ∑ P(Xn+1 = i|y1 ) = g¯n(y). i∈A

Similarly, gn(y) ≤ gn(y). Step 2: One has gn+1(y, i) = ∑ gn(y, j)w(j, i), j where w(j, i) > 0 and ∑j w(j, i) = 1, are given by n pjiP[Xn+1 = j|y0 ]P[Yn+1 = yn+1|Xn+1 = j] w(j, i) = n ∑j pjiP[Xn+1 = j|y0 ]P[Yn+1 = yn+1|Xn+1 = j] n pjiP[Xn+1 = j|y ]Πjy = 0 n+1 . [ = | n] ∑j pjiP Xn+1 j y0 Πjyn+1 ∗ Let gn(y) = gn(y, j ) : Then

gn+1(y, i) ≤ gn(y) · min w(j, i) + g¯n(y) · (1 − min w(i, j)).

Step 3: The inequalities above imply the following key inequalities: for some ρ ∈ (0, 1), ( g¯n+ (y) ≤ ρgn(y) + (1 − ρ)g¯n(y) 1 , gn+1(y) ≥ ρg¯n(y) + (1 − ρ)gn(y) 80 CHAPTER 6. HIDDEN MARKOV MODELS

and hence

  n+1 0 ≤ g¯n+1(y) − gn+1(y) ≤ (1 − 2ρ) g¯n(y) − gn(y) ≤ (1 − 2ρ) ,

Note that we need conditions P > 0 and Π > 0 to conclude that ρ ∈ (0, 1). Finally, one has

gn+1(y) − gn(y) ≤ g¯n+1(y) − gn(y) ≤ (1 − ρ)(g¯n(y) − gn(y))

and similarly

−(1 − ρ)(g¯n(y) − gn(y)) ≤ gn+1(y) − gn(y).

Therefore, for every n ≤ m,

θn gn(y) − gm(y) ≤ , θ = 1 − ρ > 0. 1 − θ

That means that the sequence {gn(·)} converges uniformly, and hence the limiting function g = limn gn is a continuous function, which is also strictly positive (c.f., Exercise 6.4). Moreover, estimates above show that Z varn(g) decays exponentially, and hence Q, viewed as a measure on B , is a Gibbs measure in the DLR sense. Chapter 7

Denoising

Last time we have seen that if {Yn} is a HMC, corresponding to / obtained from / a Markov chain {Xn} with parameters (π, p) and emission kernels n n Π = (Πi,j)i∈A,j∈B, then given finite realization Y0 = y0 we can easily estimate n n P(Xt = i|Y0 = y0 ), 0 ≤ t ≤ n.

Reminder:

Lemma 7.1.

n n αt(i)βt(i) P(Xt = i|Y0 = y0 ) = , ∑j∈A αt(j)βt(j) n where {αt(·), βt(·)}t=0 are the forward and backward variables (probabilities) given by

(i) = (i) (i) = ( )P t = n − α0 π Πiy0 , αt+1 ∑ αt ι ιiΠiyt+1 , 0, 1, . . . , 1 ι∈A and ( ) = ( ) = ( ) = − βn i 1, βt i ∑ βt+1 ι PiιΠιym+1 , t 0, 1, . . . , n 1. ι∈A

In fact, for any t,

t n n n αt(i) = P[Y0 = y0 , Xt = i], βt(i) = P[Yt+1 = yt+1|Xt = i].

n n Decoding: if we know the distribution P(Xt = ·|Y0 = y0 ), equivalently if n n we know probabilities P(Xt = ·, Y0 = y0 ) we can predict/guess Xbt from n n the observed data Y0 = y0 .

81 82 CHAPTER 7. DENOISING

Example: Xt ∈ {0, 1}, then optimal choice   0, if θt > 1 Xbt = ,  1, if θt < 0 where n n n n P(Xt = 0, Y0 = y0 ) P(Xt = 0|Y0 = y0 ) θt = n n = n n . P(Xt = 1, Y0 = y0 ) P(Xt = 1|Y0 = y0 ) Predict the most probable out come. The key to the proof is the Hidden Markov Property

..., Yt−2, Yt−1⊥Xt Yt+1, Yt+2,...

n n Let me now derive another expression for P(Xt = i|Y0 = y0 ), but using only the fact that Yt depends only on Xt, i.e.

and not even using the fact that {Xt} is Markov. n n n n We want to obtain/derive P(Xt = i|Y0 = y0 )/P(Xt = i, Y0 = y0 ). Con- sider unconditional probabilities

n n P[Xt = i, Y0 = y0 ] = P[Xt = i, Yt = yt, Yn\t = yn\t] = P[Yt = yt, Yn\t = yn\t|Xt = i]P[Xt = i]

= P[Yt = yt|Xt = i]P[Yn\t = yn\t|Xt = i]P[Xt = i]

= Πiyt P[Xt = i, Yn\t = yn\t] where n \ t = {0, . . . , n}\{t}. Thus,

P[Xt = i, Yt = yt|Yn\t = yn\t] = Πiyt P[Xt = i|Yn\t = yn\t].

Let us sum over all possible i: by law of total probability

[ = | = ] = [ = | = ] P Yt yt Yn\t yn\t ∑ Πiyt P Xt i Yn\t yn\t i or in the vector form:

T P[Yt = ·|Yn\t = yn\t] = Π P[Xt = ·|Yn\t = yn\t]. 83

Assume that ΠT is invertible. Then,

−T P[Xt = ·|Yn\t = yn\t] = Π P[Yt = ·|Yn\t = yn\t].

Hence we obtained the following result.

Proposition 7.2. If {Yn} is a random function of {Xn}, i.e. if P(Yt = b|Xt = a) = Πab for all t, independently, and the square matrix Π is invertible, then

T  −1 P(Xt = ·|Yn\t = yn\t) = Π P[Yt = ·|Yn\t = yn\t]

n for any t ∈ {0, 1, . . . , n} and yn\t ∈ B : P(Yn\t = yn\t) > 0.

−T P(Xt = ·|Yn\t = yn\t) = Π P[Yt = ·|Yn\t = yn\t],

−T P(Xt = ·, Yn\t = yn\t) = Π P[Yt = ·, Yn\t = yn\t].( 7.0.1)

But we have also seen

P(Xt = i, Yt = j, Yn\t = yn\t) = ΠijP[Xt = i, Yn\t = yn\t].( 7.0.2)

Hence combining (7.0.1) and (7.0.2), Π = [Π1| ... |ΠM],

n n P(Xt = ·, Y0 = y0 ) = πyt P[Yt = i, Yn\t = yn\t], where M u v = (u1v1,..., uMvM), u, v ∈ R . Equivalently,

n n −T P(Xt = ·, Y0 = y0 ) = πyt Π P[Yt = ·, Yn\t = yn\t].

Proposition 7.3.

n n −T P(Xt = ·, Y0 = y0 ) = πyt Π P[Yt = ·, Yn\t = yn\t].

Corollary 7.4.

−T [ = · = ] n n πyt Π P Yt , Yn\t yn\t P[Xt = ·|Y0 = y0 ] = D E. −T πyt Π P[Yt = ·, Yn\t = yn\t], 1 84 CHAPTER 7. DENOISING

Remark: The key observation

−T P(Xt = ·, Yn\t = yn\t) = Π P[Yt = ·, Yn\t = yn\t] is due to Weissman, Ordenlich, Seroussi, Verdu, Weinberger, IEEE Trans. Inf. Theory 51 (2005). 2. Denoising Problem

n+1 n n n+1 A 3 X0 → CHANNEL → Y0 ∈ A

P(Yk = j|Xk = i) = Πij, Π = (Πij): square matrix. n+1 n+1 n n Denoiser is a function F : A → A defined by F(Y0 ) = Xb0 . Loss function Λ : A × A → [0, +∞). Average cumulative loss

1 n 1 n L (Xn, yn) = Λ(X , F(yn) ) = Λ(X , X ). F 0 0 + ∑ t 0 t + ∑ t bt n 1 t=0 n 1 t=0

Suppose W is a with value in A, distributed according to some probability P. Given also is a loss function Λ : A × A → [0, +∞). Suppose your prediction is wb. Expected loss:

∑ Λ(a, wb)P(a) = Ep ∧ (w, wb). a∈A Optimal prediction (=minimizing expected loss): T ~ wb = arg min ∑ Λ(a, b)P(a) = arg min λb · P, b∈A a∈A b∈A ~ if A = {1, . . . , M}, P = (P(w = ·)): column vector M × 1, Λ = [λ1|λ2| ... |λM], where each λi is a column in Λ. This was an optional choice for 1 experiment.

Suppose {Wn}n≥0 realization of some ergodic process, with marginal dis- T ~ tribution P(·) on A. Then Wb = arg min λb · P is still an optional choice: By b∈A Ergodic Theorem

1 n Λ(W , W) → (W , W) = min + ∑ t b EP 0 b n 1 t=0 almost surely. n In practice (= denoising setup), P 6= const, Pt = P [Xt = · Y0 ] changes with t, but choosing optional Xbt for every t is still the “right” strategy. 85

n 1 n L = LF = Λ(Xt, Xbt(Y )). Xb + ∑ 0 n 1 t=0 ( 0, a1 = a2, Example: 0/1 loss: Λ(a1, a2) = 1, a1 6= a2. 1 LF = {t : Xt 6= Xbt} . n + 1

n n n • X0 is unknown to us, Y0 is observed. We would like to find Xb0 , which minimizes LF. ( Xn ∼ P, L 0 • F is a random variable, depends on n n Y0 ∼ Π( · | X0 ).

∞ Z+ – Semi-stochastic settings: X0 ∈ A is fixed, randomness/averaging comes from channel. – Stochastic settings: Average both w.r.t. X and channel.

quenched v.s. annealed Denoiser: n T n n Xbt(y0 ) = arg min λx · P(Xt = · | Y0 = y0 ), ∀t = 0, . . . , n. x∈A

• NON-UNIVERSAL SETUP n We know distribution of {X0 } and channel (Π), hence we can com- n n pute P(Xt = · | Y0 = y0 ). n Example: Hidden Markov setup: We know X0 ∼ P, (π, P). • UNIVERSAL SETUP: known channel n We do not know the probability law of {X0 }. But we know channel characteristics (Π).

T n n Xbt = arg min λXP[Xt = · | Y0 = y0 ] x∈A T n n = arg min λXP[Xt = ·, Y0 = y0 ] x∈A We have shown that in case Π is invertible,

n n −1 h i P[Xt = ·, Y0 = y0 ] = πyt Π P Yt = ·, Yn\t = yn\t

Example: 86 CHAPTER 7. DENOISING

  1 − ε ε X ∈ { } = < t 0, 1 , Π  , 0 1 − ε ε 1 − ε 0 0 1 ε ε < 2 .   1 1 − ε 1 − 1 1 − ε −ε Π 1 =   . 1 − 2ε −ε 1 − ε

2 Λ : {0, 1} → R+ is the Hamming 0/1- loss ( 0, a = a˜, Λ(a, a˜) = 1, a 6= a˜.

( n n 0, if θt ≥ 1, P[Xt = 0, Y0 = y0 ] Xbt = θt = n n . 1, if θt < 1, P[Xt = 1, Y0 = y0 ]   1 − ε Suppose yt = 0, π0 =  . ε   n n P[Xt = 0, Y0 = y0 ]   n n P[Xt = 1, Y0 = y0 ]       1 − ε 1 1 − ε −ε P(Yt = 0, Yn\t = yn\t) =       . 1 − 2ε ε −ε 1 − ε P(Yt = 1, Yn\t = yn\t)

Let a = P(Yt = 0 | Yn\t = yn\t), 1 − a = P(Yt = 1 | Yn\t = yn\t). Then         ut 1 − ε 1 − ε −ε a Xbt = 0 if for   =      , ut ≥ vt. vt ε −ε 1 − ε 1 − a

⇐⇒ (1 − ε)2a − ε(1 − ε)(1 − a) ≥ −ε2a + ε(1 − ε)(1 − a) h i (1 − ε)2 + ε(1 − ε) + ε2 + ε(1 − ε) a ≥ 2ε(1 − ε) a ≥ 2ε(1 − ε).

Remaining case yt = 1 can be treated similarly, and we obtain the follow- ing result/ conclusion/denoiser: ( yt, if P(Yt = yt | Yn\t = yn\t) ≥ 2ε(1 − ε), Xbt = yt, if P(Yt = yt | Yn\t = yn\t) < 2ε(1 − ε). 87

Problem: How do you obtain

P(Yt = · | Yn\t = yn\t)

n n from one realization/observation Y0 = y0 ? DUDE: Discrete Universal Denoiser

t−1 t−1 n n P(Yt = yt | Yn\t = yn\t) = P(Yt = yt | Y0 = y0 , Yt+1 = yt+1) ∼ t−1 t−1 t+k t+k = P(Yt = yt | Yt−k = yt−k, Yt+1 = yt+1). | {z } Use these probabilities in denoising.

Remark: “ =∼ ” is essentially a GIBBSIAN idea! Conditional Probabilities

t−1 t−1 t+k t+k P(Yt = yt | Yt−k = yt−k, Yt+1 = yt+1)

t−1 t+k are easy to estimate: For a given left context at−k, right context bt+1 and any t−1 t+k n t−1 t+k letter ct, count occurrences of word at−kctbt+1 in Y0 m(at−k, ct, bt+1) .

( t−1 t+k) t−1 t−1 t+k t+k m at−k, ct, bt+1 Pb(Yt = ct | Y = a , Y = b ) = . t−k t−k t+1 t+1 ( t−1 t+k) ∑ m at−k, c¯t, bt+1 c¯t∈A

1 k = dc log|A| ne, c < 2 . DUDE works: Z+ SEMI-STOCHASTIC SETTING: X = (X0, X1,... ) ∈ A is fixed.

n−k n n n−k n 1  t+k  Dk(x0 , z0 ) = min LXn (xk+1 , z0 ) = min Λ xt, f (zt−k) . n b k+1 A n − k ∑ Xb ∈Sk f :A → 2 t=k+1

2k Theorem 7.5. Provided kn|A| n = O(n/ log n).

 n n n n  • lim L n (X0 , Y0 ) − Dkn (X0 , Y0 ) = 0 (a.s.). n→∞ XbDUDE     n n n n 2k+1 (n − 2k)δ • L n (X Y ) − D (X Y ) ≤ C (k + )|A| ∗ − . P  Xb 0 , 0 kn 0 , 0  1 1 exp 2k DUDE C2(k + 1)|A| >δ

n n STOCHASTIC SETTINGS: Both, the input X0 and the output Y0 are random.

 n n  • D(PXn , Π) = min E L n (X0 , Y0 ) . n Xb Xb ∈Dn 88 CHAPTER 7. DENOISING

• For stationary processes {Xn}: D(PX, Π) = lim D(PXn , Π) = inf D(PXn , Π). n→∞ n≥1

2k Theorem 7.6. Provided kn|A| n = O(n).

n n lim EL n (X0 , Y0 ) = D(PX, Π). n→∞ XbDUDE

Remark: Idea of approximation

∼ t−1 t−1 t+k t+k P(Yt = · | Yn\t = yn\t) = P(Yt = · | Yt−k = yt−k, Yt+1 = yt+1) is similar (=∼ identical) to “approximation”

t−1 t−1 t−1 P(Yt = · | Y−∞ ) or P(Yt = · | Y0 = y0 )

by t−1 t−1 P(Yt = · | Yt−k = yt−k),

i.e., to saying that {Yt} is essentially a k-step Markov chain.

Fact: If {Yt}t∈Z is a stationary process, consider the k-step Markov ap- proximation {Ybt}

t−1 t−1 t−1 t−1 Pb[Ybt = yt | Yt−k = yt−k] = P[Yt = yt | Yt−k = yt−k],

then Pb → P as k → ∞. However, it is known that we can do “better”, e.g., we can achieve same quality of approximation with fewer parameters. Information Theory de- veloped various efficient algorithms, e.g., for estimating Variable Length Markov Models from data.

The natural question : whether we can do “better” in two-sided context, i.e., what are the good models and methods to construct appropriate two- sided approximations was raised in information theory.

However, it turns out that it is not “easy” to extend existing efficient one-sided algorithms to the two-sided case. Appendix A

Prerequisites in Measure Theory, Probability, and Ergodic Theory

Last modified on April 3, 2015 We start by recalling briefly basic notions and facts which will be used in subsequent chapters.

A.1 Notation

The sets of natural numbers, integers, non-negative integers, and real num- bers will be denoted by N, Z, Z+, R, respectively. The set of extended real numbers R is R ∪ {±∞}. If Λ and V are some non-empty sets, by VΛ we will denote the space of V-valued functions defined on Λ:

VΛ = { f : Λ → V}.

Λ If Π ⊆ Λ and f ∈ V , by fΠ we will denote the restriction of f to Π, i.e., Π Π1 Π2 fΠ ∈ V . If Π1 and Π2 are disjoint, f ∈ V , and g ∈ V , by fΠ1 gΠ2 Π1∪Π2 we will denote the unique element h ∈ V such that hΠ1 = f and hΠ2 = g.

A.2 Measure theory

A.2.1. Suppose Ω is some non-empty set, a collection A of subsets of Ω is called a σ-algebra if it satisfies the following properties:

(i) Ω ∈ A;

89 90 APPENDIX A. PREREQUISITES

(ii) if A ∈ A, then Ac = Ω \ A ∈ A;

(iii) for any sequence {An}n∈N where An ∈ A for every n ∈ N, then ∪n An ∈ A.

Equivalently, σ-algebra A is a collection of subsets of Ω, closed under countable operations like intersection, union, complement, symmetric dif- ference. An element of A is called a measurable subset of Ω. An intersec- tion of two σ-algebras is again a σ-algebra. This property allows us, for a given collection of sets C of Ω, to define uniquely the minimal σ-algebra σ(C) containing C, as an intersection of all σ-algebras containing C. A pair (Ω, A), where A is some σ-algebra of subsets of Ω, is called a measurable space. If the space Ω is metrizable (e.g., Ω = [0, 1] or R), one can define the Borel σ-algebra of Ω, denoted by B(Ω), as the minimal σ-algebra containing all open subsets of Ω.

A.2.2. Suppose (Ω1, A1) and (Ω2, A2) are two measurable spaces. The map T : Ω1 → Ω2 is called (A1, A2)-measurable if for any A2 ∈ A2, the full preimage of A2 is a measurable subset of Ω1, i.e., −1 T A2 = {ω ∈ Ω1 : T(ω) ∈ A2} ∈ A1.

By definition, a random variable is a measurable map from (Ω1, A1) into (R, B(R)). If (Ω, A) is a measurable space, and T : Ω → Ω is a measurable map, then T is called a measurable transformation of (Ω, A) or an endo- morphism of (Ω, A). If, furthermore, T is invertible and T−1 : Ω → Ω is also measurable, then T is called a measurable isomorphism of (Ω, A).

A.2.3. Suppose (Ω, A) is a measurable space. The function µ : A → [0, +∞] is called a measure if

• µ(∅) = 0;

• for any countable collection of pairwise disjoint sets {An}n, An ∈ A, one has [  µ An = ∑ µ(An). (A.2.1) n n

The triple (Ω, A, µ) is called a measure space. If µ(Ω) = 1, then µ is called a probability measure, and (Ω, A, µ) a probability (measure) space.

A.2.4. Suppose (Ω, A, µ) is called a measure space. Then for any A ∈ A, the indicator function of A is ( 1, ω ∈ A, IA(ω) = 0, ω 6∈ A. A.2. MEASURE THEORY 91

The Lebesgue integral of f = IA is defined as Z IA dµ = µ(A).

K A function f is called simple if there exist K ∈ N, measurable sets {Ak}k=1 K with Ak ∈ A, and non-negative numbers {αk}k=1 such that K f ( ) = ( ) ω ∑ αkIAk ω . k=1 The Lebesgue integral of the simple function f is defined as Z K f dµ = ∑ αkµ(Ak). k=1

Suppose { fn} is a monotonically increasing sequence of simple functions and f = limn fn. Then the Lebesgue integral of f is defined as Z Z f dµ = lim fn dµ. n

It turns out that every non-negative measurable function f : Ω → R+ can be represented as a limit of increasing sequence of simple functions.

For a measurable function f : Ω → R, let f+ and f− be the positive and the negative part of f , respectively, i.e. f = f+ − f−. The function f is called Lebesgue integrable if Z Z f+ dµ, f− dµ < +∞ and the Lebegsue integral of f is then defined as Z Z Z f dµ = f+ dµ − f− dµ.

The set of all Lebesgue integrable function of (Ω, A, µ), will be denoted by L1(Ω, A, µ), or L1(Ω) and L1(µ), if the latter cases do not lead to confusion. If (Ω, A, µ) is a , we will sometimes denote the Lebesgue R integral f dµ by Eµ f or E f . A.2.5. Conditional expectation with respect to a sub-σ-algebra. Suppose (Ω, A, µ) is a probability space, and F is some sub-σ-algebra of A. Sup- pose f ∈ L1(Ω, A, µ) is a A-measurable Lebesgue integrable function. The conditional expectation of f given F will be denoted by E( f |F), and is by definition an F-measurable function on Ω such that Z Z E( f |F) dµ = f dµ C C 92 APPENDIX A. PREREQUISITES

for every C ∈ F.

A.2.6. Martingale convergence theorem.

Suppose (Ω, A, µ) is a measure space, and {An} is a sequence of sub-σ- algebras of A such that An ⊆ An+1 for all n. Denote by A∞ the minimal 1 σ-algebra containing all An. Then for any f ∈ L (Ω, A, µ)

Eµ( f |An) → Eµ( f |A∞) µ-almost surely and in L1(Ω).

A.2.7. Absolutely continuous measures and the Radon-Nikodym Theorem. Suppose ν and µ are two measures on the measurable space (Ω, A). The measure ν is absolutely continuous with respect to µ (denoted by ν  µ) if ν(A) = 0 for all A ∈ A such that µ(A) = 0. The Radon-Nikodym theorem states that if µ and ν are two σ-finite mea- sures on (Ω, A), and ν  µ, then there exists a non-negative measurable function f , called the (Radon-Nikodym) density of ν with respect to µ, such that Z ν(A) = f dµ, for all A ∈ A. A

A.3 Stochastic processes

Suppose (Ω, A, µ) a probability space. A {Xn} is a collection of random variables Xn : Ω → R, indexed by n ∈ T , where the time is T = Z+ or Z. Stochastic process can be described by the finite-dimensional distributions k (marginals): for every (n1,..., nk) ∈ T and Borel sets An1 ,..., Ank ⊂ R, put

µn1,...,nk (An1 ,..., Ank ) := µ ({ω ∈ Ω : Xn1 (ω) ∈ An1 ,..., Xnk (ω) ∈ Ank }) . In the opposite direction, a consistent family of finite-dimensional distribu- tions can be used to define a stochastic process. k Process {Xn} is called stationary if for all (n1,..., nk) ∈ T , t ∈ Z+, and

Borel sets An1 ,..., Ank ⊆ R one has

µn1+t,...,nk+t(An1 ,..., Ank ) = µn1,...,nk (An1 ,..., Ank ) i.e., the time-shift does not affect the finite-dimensional marginal distribu- tions. A.4. ERGODIC THEORY 93

A.4 Ergodic theory

Ergodic Theory originates from the Boltzmann-Maxwell ergodic hypothe- sis and is the study of measure preserving dynamical systems (Ω, A, µ, T) where

• (Ω, A, µ) is a (Lebesgue) probability space

• T : Ω → Ω is measure preserving: for all A ∈ A µ(T−1 A) = µ ({x ∈ X : T(x) ∈ A}) = µ(A).

1 ∼ Example 1. Let Ω = T = R/Z = [0, 1), µ= Lebesgue measure,

• Circle rotation: Tα(x) = x + α mod 1, • Doubling map: T(x) = 2x mod 1.   Example 2. Consider a finite set (alphabet) A = {1, . . . , N} , and put n o Z+ Ω = A = ! = (ωn)n≥0 : ωn ∈ A .

The corresponding Borel σ-algebra A is generated by cylinder sets:  [ ] =  = = ∈ A ai1 ,..., ain ! : ωi1 ai1 ,..., ωin ain , aik .

A measure p = (p(1),..., p(N)) on A can be extended to a measure µ = pZ+ on (Ω, A)    = = = ( ) ··· ( ) µ ! : ωi1 ai1 ,..., ωin ain p ai1 p ain . The measure µ is clearly preserved by the left shift σ : Ω → Ω  = ∀ σ! n ωn+1 n. Quadruple (Ω, A, µ, σ) is called the p-Bernoulli shift. Definition A.1. Measure-preserving dynamical system (Ω, A, µ, T) is er- godic if every invariant set is trivial: A = T−1 A ⇒ µ(A) = 0 or 1, equivalently, if every invariant function is constant: f (ω) = f (T(ω))(µ − a.e.) ⇒ f (ω) = const (µ − a.e.). 94 APPENDIX A. PREREQUISITES

Theorem A.2 (Birkhoff’s Pointwise Ergodic Theorem). Suppose (Ω, A, µ, T) is an ergodic measure-preserving dynamical system. Then for all f ∈ L1(Ω, µ)

1 n−1 Z ∑ f (Tt(x)) → f (x)µ(dx) as n → ∞ n t=0 µ-almost surely and in L1.

A.5 Entropy

Suppose p = (p1,..., pN) is a probability vector, i.e., N pi ≥ 0, ∑ pi = 1. i=1 The entropy of p is N H(p) = − ∑ pi log2 pi. i=1

A.5.1 Shannon’s entropy rate per symbol

Definition A.3. Suppose Y = {Yk} is a stationary process with values in a finite alphabet A, and µ is the corresponding translation invariant measure. n Fix n ∈ N, and consider the distribution of n-tuples (Y0,..., Yn−1) ∈ A . Denote by Hn entropy of this distribution: n−1 n−1 n−1 n−1 n−1 Hn = H(Y0 ) = − ∑ P[Y0 = a0 ] log2 P[Y0 = a0 ]. n (a0,...,an−1)∈A

The entropy (rate) of the process Y = {Yk}, equivalently, of the measure P, denoted by h(Y), h(P), or hσ(P), is defined as 1 h(Y) = h(P) = hσ(P) = lim Hn n→∞ n (the limit exists!).

{ } Similarly, the entropy can be defined for stationary random fields Yn n∈Zd , d d Yn ∈ A. Let Λn = [0, n − 1] ∩ Z . Then

1 h(Y) = lim − P[YΛ = aΛ ] log P[YΛ = aΛ ], n→∞ | | ∑ n n 2 n n Λd Λn aΛn ∈A again one can easily show that the limit exists. A.5. ENTROPY 95

A.5.2 Kolmogorov-Sinai entropy of measure-preserving sys- tems

Suppose (Ω, A, µ, T) is a measure-preserving dynamical system. Suppose

C = {C1,..., CN} is a finite measurable partition of Ω, i.e., a partition of Ω into measurable sets Ck ∈ A. For every ω ∈ Ω and n ∈ Z+ (or Z), put

C n Yn(x) = Yn (ω) = j ∈ {1, . . . , N} ⇔ T (ω) ∈ Cj.

C Proposition: For any C, the corresponding process Y = {Yn}, Yn : Ω → {1, . . . , N}, is a stationary process with   [ = = ] =  ∈ ∈ n( ) ∈ P Y0 j0,..., Yn jn µ ω Ω : ω Cj0 ,..., T x Cjn .

Definition A.4. If (Ω, A, µ, T) is a measure preserving dynamical system, and C = {C1,..., CN} is a finite measurable partition of Ω, then the en- tropy of (Ω, A, µ, T) with respect to C is defined as as the Shannon entropy of the corresponding symbolic process YC:

C hµ(T, C) = h(Y ).

Finally, the measure-theoretic or the Kolmogorov-Sinai entropy of (Ω, A, µ, T) is defined as hµ(T) = sup hµ(T, C). C is finite

The following theorem of Sinai eliminates the need to consider all finite partitions.

Definition A.5. A partition C is called generating (or, a generator) for the dynamical system (Ω, A, µ, T) if the smallest σ-algebra containing sets n T (Cj), j = 1, . . . , N, n ∈ Z, is A. Theorem A.6 (Ya. Sinai). If C is a generating partition then

hµ(T) = hµ(T, C) 96 APPENDIX A. PREREQUISITES Bibliography

[1] D. Capocaccia, A definition of Gibbs state for a compact set with Zν action, Comm. Math. Phys. 48 (1976), no. 1, 85–88. MR0415675 (54 #3755) ↑28 [2] R.L Dobrushin, Description of a random field by means of conditional probabilities and the conditions governing its regularity., Theor. Prob. Appl. 13 (1968), 197–224. ↑12 [3] O. E. Lanford III and D. Ruelle, Observables at infinity and states with short range correlations in statistical mechanics, Comm. Math. Phys. 13 (1969), 194–215. ↑12 [4] Ja. G. Sina˘ı, Gibbs measures in ergodic theory, Uspehi Mat. Nauk 27 (1972), no. 4(166), 21–64 (Russian). MR0399421 (53 #3265) ↑27 [5] O. K. Kozlov, A Gibbs description of a system of random variables, Problemy Peredaˇci Informacii 10 (1974), no. 3, 94–103. ↑16 [6] W. G. Sullivan, Potentials for almost Markovian random fields, Comm. Math. Phys. 33 (1973), 61–74. ↑16 [7] Hans-Otto Georgii, Gibbs measures and phase transitions, de Gruyter Studies in Math- ematics, vol. 9, Walter de Gruyter & Co., Berlin, 1988. ↑11, 16 [8] , Gibbs measures and phase transitions, 2nd ed., de Gruyter Studies in Math- ematics, vol. 9, Walter de Gruyter & Co., Berlin, 2011. MR2807681 (2012d:82015) ↑11 [9] François Ledrappier, Principe variationnel et systèmes dynamiques symboliques, Z. Wahrscheinlichkeitstheorie und Verw. Gebiete 30 (1974), 185–202. MR0404584 (53 #8384) ↑21, 27 [10] N. Berger, C. Hoffman, and V. Sidoravicius, Nonuniqueness for specifications in `2+e, ArXiv Mathematics e-prints (2003), available at math/0312344. ↑27 [11] Rufus Bowen, Equilibrium states and the ergodic theory of Anosov diffeomorphisms, Lecture Notes in Mathematics, Vol. 470, Springer-Verlag, Berlin, 1975. MR0442989 (56 #1364) ↑11, 28 [12] , Equilibrium states and the ergodic theory of Anosov diffeomorphisms, Second revised edition, Lecture Notes in Mathematics, vol. 470, Springer-Verlag, Berlin, 2008. With a preface by ; Edited by Jean-René Chazottes. MR2423393 (2009d:37038) ↑11 [13] Maury Bramson and Steven Kalikow, Nonuniqueness in g-functions, Israel J. Math. 84 (1993), no. 1-2, 153–160, DOI 10.1007/BF02761697. MR1244665 (94h:28011) ↑27 [14] Wolfgang Doeblin and Robert Fortet, Sur des chaînes à liaisons complètes, Bull. Soc. Math. France 65 (1937), 132–148 (French). MR1505076 ↑20

97 98 BIBLIOGRAPHY

[15] Sandro Gallo and Frédéric Paccaut, On non-regular g-measures, Nonlinearity 26 (2013), no. 3, 763–776, DOI 10.1088/0951-7715/26/3/763. MR3033569 ↑27 [16] Anders Johansson, Anders Öberg, and Mark Pollicott, Unique Bernoulli g-measures, J. Eur. Math. Soc. (JEMS) 14 (2012), no. 5, 1599–1615, DOI 10.4171/JEMS/342. MR2966661 ↑27 [17] Michael Keane, Strongly mixing g-measures, Invent. Math. 16 (1972), 309–324. MR0310193 (46 #9295) ↑20 [18] David Ruelle, Thermodynamic formalism, Encyclopedia of Mathematics and its Ap- plications, vol. 5, Addison-Wesley Publishing Co., Reading, Mass., 1978. The math- ematical structures of classical equilibrium statistical mechanics; With a foreword by Giovanni Gallavotti and Gian-Carlo Rota. MR511655 (80g:82017) ↑11, 16 [19] R. Fernández and G. Maillard, Chains with complete connections and one-dimensional Gibbs measures, Electron. J. Probab. 9 (2004), 145–176. ↑27 [20] Iain L. MacDonald and Walter Zucchini, Hidden Markov and other models for discrete- valued time series, Monographs on Statistics and Applied Probability, vol. 70, Chap- man & Hall, London, 1997. MR1692202 (2000g:62009) ↑75 [21] Timo Koski, Hidden Markov models for bioinformatics, Computational Biology Series, vol. 2, Kluwer Academic Publishers, Dordrecht, 2001. MR1888250 (2003g:62003) ↑75 [22] Walter Zucchini and Iain L. MacDonald, Hidden Markov models for time series, Mono- graphs on Statistics and Applied Probability, vol. 110, CRC Press, Boca Raton, FL, 2009. An introduction using R. MR2523850 (2010k:62001) ↑75 [23] L. Rabiner and B.H. Juang, An introduction to hidden Markov models, ASSP Magazine, IEEE 3 (1986), no. 1, 4-16. ↑75 [24] Yariv Ephraim and Neri Merhav, Hidden Markov processes, IEEE Trans. Inform. Theory 48 (2002), no. 6, 1518–1569, DOI 10.1109/TIT.2002.1003838. Special issue on Shannon theory: perspective, trends, and applications. MR1909472 (2003f:94024) ↑75