<<

MA427 Ergodic

Course Notes (2012-13)

1 Introduction

1.1 Orbits Let X be a mathematical space. For example, X could be the unit interval [0, 1], a circle, a , or something far more complicated like a Cantor . Let T : X → X be a function that maps X into itself. Let x ∈ X be a point. We can repeatedly apply the map T to the point x to obtain the sequence: {x, T (x),T (T (x)),T (T (T (x))), . . . , ...}. We will often write T n(x) = T (··· (T (T (x)))) (n times). The sequence of points

x, T (x),T 2(x),... is called the of x. We think of applying the map T as the passage of time. Thus we think of T (x) as where the point x has moved to after time 1, T 2(x) is where the point x has moved to after time 2, etc. Some points x ∈ X return to where they started. That is, T n(x) = x for some n > 1. We say that such a point x is periodic with period n. By way of contrast, points may move move densely around the space X. (A sequence is said to be dense if (loosely speaking) it comes arbitrarily close to every point of X.) If we take two points x, y of X that start very close then their orbits will initially be close. However, it often happens that in the long term their orbits move apart and indeed become dramatically different. This is known as sensitive dependence on initial conditions, and is popularly known as chaos. In general, for a given T it is impossible to understand the orbit structure of every orbit. takes a more qualitative approach: we aim to describe the long term behaviour of a typical orbit, at least in the case when T satisfies a technical condition called ‘-preserving’. To make the notion of ‘typical’ precise, we need to use measure theory. Roughly speaking, a measure is a function that assigns a ‘size’ to a given subset of X. One of the simplest measures is Lebesgue measure on [0, 1]; here the measure of an interval [a, b] ⊂ [0, 1] is just its length b − a.

1 1.2 Introducing the doubling map 1 INTRODUCTION

Let T : [0, 1] → [0, 1] and fix a subinterval [a, b] ⊂ [0, 1]. Let x ∈ [0, 1]. What is the frequency with which the orbit of x hits the set [a, b]? Recall that the characteristic function

χA of a subset A is defined by  1 if x ∈ A, χ (x) = A 0 if x 6∈ A. Then the number of times the first n points of the orbit of x hits [a, b] is given by

n−1 X j χ[a,b](T (x)). j=0 Thus the proportion of the first n points in the orbit of x that lie in [a, b] is equal to

n−1 1 X χ (T j (x)). n [a,b] j=0 Hence the frequency with which the orbit of x lies in [a, b] is given by

n−1 1 X j lim χ[a,b](T (x)) n→∞ n j=0 (assuming of course that this limit exists!). One of the main results of the course, namely Birkhoff’s ergodic theorem, tells us that when T is ergodic (a technical property—albeit an important one—that we won’t define here) then for ‘most’ orbits the above frequency is equal to the measure of the interval [a, b]. In the case of Lebesgue measure, this means that:

n−1 1 X j lim χ[a,b](T (x)) = b − a, for x ∈ X. n→∞ n j=0 (Here ‘almost all’ is the technical measure-theoretic way of saying ‘most’.) One way of looking at Birkhoff’s ergodic theorem is the following: the time average of a typical point x ∈ X (i.e. the frequency with which its orbit lands in a given subset) is equal to the space average (namely, the measure of that subset). In this course, we develop the necessary background that builds up to Birkhoff’s ergodic theorem, together with some illuminating examples. We also study en route some interesting diversions to other areas of mathematics, notably .

1.2 Introducing the doubling map Let X = [0, 1] denote the unit interval. Define the map T : X → X by:  2x if 0 ≤ x < 1/2 T (x) = 2x mod 1 = 2x − 1 if 1/2 ≤ x ≤ 1 (‘mod 1’ stands for ‘modulo 1’ and means ‘ignore the integer part’; for example 3.456 mod 1 is .456).

2 1.3 Leading digits 1 INTRODUCTION

Exercise 1.1. Draw the graph of the doubling map. By sketching the orbit in the this graph, indicate that 7/15 is periodic. Try sketching the orbits of some points near 7/15. In §1.1 we mentioned in passing that we will be interested in a technical condition called ‘measure-preserving’. We can illustrate this property here. Fix an interval [a, b] and consider the set T −1[a, b] = {x ∈ [0, 1] | T (x) ∈ [a, b]}. One can easily check that

a b a + 1 b + 1 T −1[a, b] = , ∪ , , 2 2 2 2

so that T −1[a, b] is the union of two intervals, each of length (b − a)/2. Hence the length of T −1[a, b] is equal to b − a, which is the same as the length of [a, b].

1.3 Leading digits

The leading digit of a number n ∈ N is the digit (between 1 and 9) that appears at the leftmost-end of n when n is written in base 10. Thus, the leading digit of 4629 is 4, etc. Consider the sequence 2n:

1, 2, 4, 8, 16, 32, 64, 128,...

and consider the sequence of leading digits:

1, 2, 4, 8, 1, 3, 6, 1,....

Exercise 1.2. By writing down the sequence of leading digits for 2n for n = 1, 2,..., something large of your choosing, try guessing the frequency with which the digit 1 appears as a leading digit. (Hint: it isn’t 3/10ths.) Do the same for the digit 2. Can you guess the frequency with which the digit r appears? We will study this problem in greater detail later.

3 2 UNIFORM DISTRIBUTION

2 Uniform Distribution

2.1 Uniform distribution and Weyl’s criterion Before we discuss dynamical systems in greater detail, we shall consider a simpler setting which highlights some of the main ideas in ergodic theory.

Let xn be a sequence of real numbers. We may decompose xn as the sum of its integer part [xn] = sup{m ∈ Z | m ≤ xn} (i.e. the largest integer which is less than or equal to xn) and its fractional part {xn} = xn − [xn]. Clearly, 0 ≤ {xn} < 1. The study of xn mod 1 is the study of the sequence {xn} in [0, 1).

Definition 2.1. We say that the sequence xn is uniformly distributed mod 1 if for every a, b with 0 ≤ a < b < 1, we have that 1 card{j | 0 ≤ j ≤ n − 1, {x } ∈ [a, b]} → b − a, as n → ∞. n j

(The condition is saying that the proportion of the sequence {xn} lying in [a, b] converges to b − a, the length of the interval.)

Remark 2.2. We can replace [a, b] by [a, b), (a, b] or (a, b) with the same result.

Exercise 2.3. Show that if xn is uniformly distributed mod 1 then {xn} is dense in [0, 1).

The following result gives a necessary and sufficient condition for xn to be uniformly distributed mod 1.

Theorem 2.4 (Weyl’s Criterion). The following are equivalent:

(i) the sequence xn is uniformly distributed mod 1;

(ii) for each ` ∈ Z \{0}, we have n−1 1 X e2πi`xj → 0 n j=0 as n → ∞.

2.2 The sequence xn = nα

The behaviour of the sequence xn = nα depends on whether α is rational or irrational. If α ∈ Q, it is easy to see that {nα} can take on only finitely many values in [0, 1): if α = p/q (p ∈ Z, q ∈ N, hcf(p, q) = 1) then {nα} takes the q values p  2p  (q − 1)p  0, , ,..., . q q q

In particular, {nα} is not uniformly distributed mod 1.

4 2.3 Proof of Weyl’s criterion 2 UNIFORM DISTRIBUTION

If α ∈ R \ Q then the situation is completely different. We shall apply Weyl’s Criterion. For l ∈ Z \{0}, e2πi`α 6= 1, so we have

n−1 1 X 1 e2πi`nα − 1 e2πi`jα = . n n e2πi`α − 1 j=0

Hence n−1 1 X 1 2 e2πi`jα ≤ → 0, as n → ∞. n n |e2πi`α − 1| j=0 Hence nα is uniformly distributed mod 1.

Remarks 2.5. (i) More generally, we could consider the sequence xn = nα + β. It is easy to see by modifying the above arguments that xn is uniformly distributed mod 1 if and only if α is irrational. n (ii) Fix α > 1 and consider the sequence xn = α x, for some x ∈ (0, 1). Then it is possible to show that for almost every x, the sequence xn is uniformly distributed mod 1. We will prove this later in the course, at least in the cases when α = 2, 3, 4,.... n (iii) Suppose in the above remark we fix x = 1 and consider the sequence xn = α . Then one can show that xn is uniformly distributed mod 1 for almost all α > 1. However, not a single example of such an α is known! Exercise 2.6. Calculate the frequency with which 2n has r (r = 1,..., 9) as the leading digit of its base 10 representation. (You may assume that log10 2 is irrational.) (Hint: first show that 2n has leading digit r if and only if

r 10` ≤ 2n < (r + 1)10` for some ` ∈ Z+.) Exercise 2.7. Calculate the frequency with which 2n has r (r = 0, 1,..., 9) as the second digit of its base 10 representation.

2.3 Proof of Weyl’s criterion

2πixj 2πi{xj } Proof. Since e = e , we may suppose, without loss of generality, that xj = {xj }. (i) ⇒ (ii): Suppose that xj is uniformly distributed mod 1. If χ[a,b] is the characteristic function of the interval [a, b], then we may rewrite the definition of uniform distribution in the form n−1 1 X Z 1 χ (x ) → χ (x) dx, as n → ∞. n [a,b] j [a,b] j=0 0 From this we deduce that

n−1 1 X Z 1 f (x ) → f (x) dx, as n → ∞, n j j=0 0

5 2.3 Proof of Weyl’s criterion 2 UNIFORM DISTRIBUTION

whenever f is a step function, i.e., a linear combination of characteristic functions of intervals. Now let g be a continuous function on [0, 1] (with g(0) = g(1)). Then, given ε > 0, we

can find a step function f with kg − f k∞ ≤ ε. We have the estimate

n−1 1 X Z 1 g(x ) − g(x) dx n j j=0 0 n−1 n−1 1 X 1 X Z 1 ≤ (g(x ) − f (x )) + f (x ) − f (x) dx n j j n j j=0 j=0 0 Z 1 Z 1

+ f (x) dx − g(x) dx 0 0 n−1 1 X Z 1 ≤ 2ε + f (x ) − f (x) dx . n j i=0 0 Since the last term converges to zero, we thus obtain

n−1 1 X Z 1 lim sup g(xj ) − g(x) dx ≤ 2ε. n→∞ n j=0 0

Since ε > 0 is arbitrary, this gives us that

n−1 1 X Z 1 g(x ) → g(x) dx, n j j=0 0

as n → ∞, and this holds, in particular, for g(x) = e2πi`x . If ` 6= 0 then

Z 1 e2πi`x dx = 0, 0 so the first implication is proved. (ii) ⇒ (i): Suppose now that Weyl’s Criterion holds. Then

n−1 1 X Z 1 g(x ) → g(x) dx, as n → ∞, n j j=0 0

Pm 2πi`k x whenever g(x) = k=1 αk e is a trigonometric polynomial. Let f be any continuous function on [0, 1] with f (0) = f (1). Given ε > 0 we can find a trigonometric polynomial g such that kf − gk∞ ≤ ε. (This is a consequence of Fejér’s Theorem.) As in the first part of the proof, we can conclude that

n−1 1 X Z 1 f (x ) → f (x) dx, as n → ∞. n j j=0 0

6 2.4 Generalisation to Higher Dimensions 2 UNIFORM DISTRIBUTION

Now consider the interval [a, b] ⊂ [0, 1). Given ε > 0, we can find continuous functions f1, f2 (with f1(0) = f1(1), f2(0) = f2(1)) such that

f1 ≤ χ[a,b] ≤ f2 and Z 1 f2(x) − f1(x) dx ≤ ε. 0 We then have that n−1 n−1 1 X 1 X Z 1 lim inf χ[a,b](xj ) ≥ lim inf f1(xj ) = f1(x) dx n→∞ n n→∞ n j=0 j=0 0 Z 1 Z 1 ≥ f2(x) dx − ε ≥ χ[a,b](x) dx − ε 0 0 and n−1 n−1 1 X 1 X Z 1 lim sup χ[a,b](xj ) ≤ lim sup f2(xj ) = f2(x) dx n→∞ n n→∞ n j=0 j=0 0 Z 1 Z 1 ≤ f1(x) dx + ε ≤ χ[a,b](x) dx + ε. 0 0 Since ε > 0 is arbitrary, we have shown that

n−1 1 X Z 1 lim χ[a,b](xj ) = χ[a,b](x) dx = b − a, n→∞ n j=0 0

so that xi is uniformly distributed mod 1.

2.4 Generalisation to Higher Dimensions

We shall now look at the distribution of sequences in Rk . 1 k k Definition 2.8. A sequence xn = (xn , . . . , xn ) ∈ R is said to be uniformly distributed mod 1 if, for each choice of k intervals [a1, b1],..., [ak , bk ] ⊂ [0, 1), we have that

n−1 k k 1 X Y Y χ ({x i }) → (b − a ), as n → ∞. n [ai ,bi ] j i i j=0 i=1 i=1 We have the following criterion for uniform distribution.

k Theorem 2.9 (Multi-dimensional Weyl’s Criterion). The sequence xn ∈ R is uniformly dis- tributed mod 1 if and only if

n−1 1 X 2πi(` x1+···+` xk ) e 1 j k j → 0, as n → ∞, n j=0

k for all ` = (`1, . . . , `k ) ∈ Z \{0}.

7 2.5 Generalisation to polynomials 2 UNIFORM DISTRIBUTION

Remark 2.10. Here and throughout 0 ∈ Zk denotes the zero vector (0,..., 0). Proof. The proof is essentially the same as in the case k = 1.

We shall apply this result to the sequence xn = (nα1, . . . , nαk ), for real numbers α1, . . . , αk . Suppose first that the numbers α1, . . . , αk , 1 are rationally independent. This means that if r1, . . . , rk , r are rational numbers such that

r1α1 + ··· + rk αk + r = 0,

k then r1 = ··· = rk = r = 0. In particular, for ` = (`1, . . . , `k ) ∈ Z \{0} and n ∈ N,

`1nα1 + ··· + `k nαk ∈/ Z,

so that e2πi(`1nα1+···+`k nαk ) 6= 1. We therefore have that

n−1 2πin(`1α1+···+`k αk ) 1 X 2πi(` jα +···+` jα ) 1 e − 1 e 1 1 k k = n n e2πi(`1α1+···+`k αk ) − 1 j=0 1 2 ≤ → 0, as n → ∞. n |e2πi(`1α1+···+`k αk ) − 1|

Therefore, by Weyl’s Criterion, (nα1, . . . , nαk ) is uniformly distributed mod 1. Now suppose that the numbers α1, . . . , αk , 1 are rationally dependent, i.e. there exist rational numbers r1, . . . , rk , r, not all equal to zero, such that r1α1 +···+rk αk +r = 0. Then k there exists ` = (`1, . . . , `k ) ∈ Z \{0} such that

`1α1 + ··· + `k αk ∈ Z.

Thus e2πi(`1nα1+···+`k nαk ) = 1 for all n ∈ N and so

n−1 1 X e2πi(`1jα1+···+`k jαk ) = 1 6→ 0, as n → ∞. n j=0

Therefore, (nα1, . . . , nαk ) is not uniformly distributed mod 1.

2.5 Generalisation to polynomials We shall now consider another generalisation of the sequence nα. Write

k k−1 p(n) = αk n + αk−1n + ··· α1n + α0.

Theorem 2.11 (Weyl). If any one of α1, . . . , αk is irrational then p(n) is uniformly distributed mod 1.

8 2.5 Generalisation to polynomials 2 UNIFORM DISTRIBUTION

(Note that it is irrelevent whether or not α0 is irrational.) To prove this theorem we shall need the following technical result.

Lemma 2.12 (van der Corput’s Inequality). Let z0, . . . , zn−1 ∈ C and let 1 ≤ m ≤ n − 1. Then 2 n−1 n−1 2 X X 2 m zj ≤ m(n + m − 1) |zj | j=0 j=0 m−1 n−1−j X X + 2(n + m − 1)< (m − j) zi+j z¯i . j=1 i=0 Proof. Consider the following sums:

S1 = z0

S2 = z0 + z1 . .

Sm = z0 + z1 + ··· + zm−1

Sm+1 = z1 + z2 + ··· + zm . .

Sn = zn−m + zn−m+1 + ··· + zn−1

Sn+1 = zn−m+1 + zn−m+2 ··· + zn−1 . .

Sn+m−2 = zn−2 + zn−1

Sn+m−1 = zn−1.

Notice that each zj occurs in exactly m of the sums Sk . Thus

n−1 X S1 + ··· + Sn+m−1 = m zj j=0 and so 2 n−1 2 X 2 m zj = |S1 + ··· + Sn+m−1| j=0 2 ≤ (|S1| + ··· + |Sn+m−1|) 2 2 ≤ (n + m − 1)(|S1| + ··· + |Sn+m−1| ), using the fact that l !2 l X X 2 ak ≤ l ak . k=1 k=1

9 2.5 Generalisation to polynomials 2 UNIFORM DISTRIBUTION

Now, using the formula

l 2 l !

X X 2 X ak = |ak | + 2Re ai aj , k=1 k=1 i

n−1 ! m−1 n−r−1 ! 2 2 X 2 X X |S1| + ··· + |Sn+m−1| = m |zj | + 2Re (m − r) zj zj+r . j=0 r=1 j=0 Hence

n−1 2 n−1 !

2 X X 2 m zj ≤ m(n + m − 1) |zj | j=0 j=0 m−1 n−j−1 ! X X + 2(n + m − 1)Re (m − j) zi zi+j , j=1 i=1 as required.

(m) th Let xn ∈ R. For each m ≥ 1 define the sequence xn = xn+m − xn of m differences. The following lemma allows us to infer the uniform distribution of the sequence xn if we know th the uniform distribution of the each of the m differences of xn. (m) Lemma 2.13. Let xn ∈ R be a sequence. Suppose that for each m ≥ 1 the sequence xn of th m differences is uniformly distributed mod 1. Then xn is uniformly distributed mod 1.

Proof. We shall apply Weyl’s Criterion. We need to show that if ` ∈ Z \{0} then

n−1 1 X e2πi`xj → 0, as n → ∞. n j=0

2πi`xj Let zj = e for j = 0, . . . , n − 1. Note that |zj | = 1. Let 1 < m < n. By van der Corput’s inequality,

n−1 2 m2 X m e2πi`xj ≤ (n + m − 1)n n2 n2 j=0 m−1 n−1−j 2(n + m − 1) X (m − j) X + < e2πi`(xi+j −xi ) n n j=1 i=0 m−1 m 2(n + m − 1) X = (m + n − 1) + < (m − j)A n n n,j j=1 where n−1−j n−1−j 1 X 2πi`(x −x ) 1 X 2πi`x(j) A = e i+j i = e i . n,j n n i=0 i=0

10 2.5 Generalisation to polynomials 2 UNIFORM DISTRIBUTION

(j) th As the sequence xi of j differences is uniformly distributed mod 1, by Weyl’s criterion we have that An,j → 0 for each j = 1, . . . , m − 1. Hence for each m ≥ 1

n−1 2 m2 X (n + m − 1) 2πi`xj lim sup 2 e ≤ lim sup m = m. n→∞ n n→∞ n j=0

Hence, for each m > 1 we have

n−1 1 X 1 lim sup e2πii`xj ≤ √ . n→∞ n m j=0

As m > 1 is arbitrary, the result follows. Proof of Weyl’s Theorem. We will only prove Weyl’s theorem in the special case where the

leading digit αk of k p(n) = αk n + ··· + α1n + α0

is irrational. (The general case, where αi is irrational for some 1 ≤ i ≤ k can be deduced very easily from this special case, but we will not go into this.) We shall use induction on the degree of p. Let ∆(k) denote the statement ‘for every polynomial q of degree ≤ k, with irrational leading coefficient, the sequence q(n) is uniformly distributed mod 1’. We know that ∆(1) is true. k Suppose that ∆(k −1) is true. Let p(n) = αk n +···+α1n+α0 be an arbitrary polynomial of degree k with αk irrational. For each m ∈ N, we have that

p(n + m) − p(n) k k−1 = αk (n + m) + αk−1(n + m) + ··· + α1(n + m) + α0 k k−1 − αk n − αk−1n − · · · − α1n − α0 k k−1 k−1 k−2 = αk n + αk kn m + ··· + αk−1n + αk−1(k − 1)n h k k−1 + ··· + α1n + α1m + α0 − αk n − αk−1n − · · · − α1n − α0.

After cancellation, we can see that, for each m, p(n + m) − p(n) is a polynomial of degree k − 1, with irrational leading coefficient αk km. Therefore, by the inductive hypothesis, p(n + m) − p(n) is uniformly distributed mod 1. We may now apply Lemma 2.13 to conclude that p(n) is uniformly distributed mod 1 and so ∆(k) holds. This completes the induction.

k k−1 k k−1 Exercise 2.14. Let p(n) = αk n + αk−1n + ··· + α1n + α0, q(n) = βk n + βk−1n + ··· + β1n + β0. Show that (p(n), q(n)) is uniformly distributed mod 1 if at least one of (αk , βk , 1),..., (α1, β1, 1) is rationally independent.

11 3 EXAMPLES OF DYNAMICAL SYSTEMS

3 Examples of Dynamical Systems

3.1 The circle Several of the key examples in the course take place on the circle. There are two different— although equivalent—ways of thinking about the circle. We can think of the circle as the quotient R/Z = {x + Z | x ∈ R} which is easily seen to be equivalent to [0, 1) mod 1. We refer to this as additive notation. Alternatively, we can regard the circle as 1 S = {z ∈ C | |z| = 1} = {exp 2πiθ | θ ∈ [0, 1)}. We refer to this as multiplicative notation. The two viewpoints are obviously equivalent, and we shall use whichever is most convenient given the circumstances. We will also be interested in maps of the k-dimensional torus. The k-dimensional torus is defined to be k k k k k R /Z = {x + Z | x ∈ R } = [0, 1) mod 1 (in additive notation) and 1 1 S × · · · × S (k-times) = {(exp 2πiθ1,..., exp 2πiθk ) | θ1, . . . , θk ∈ [0, 1)} (in multiplicative notation).

3.2 Rotations on a circle Fix α ∈ [0, 1) and define the map T : R/Z → R/Z : x 7→ x + α mod 1. (In multiplicative notation this is: exp 2πiθ 7→ exp 2πi(θ+α).) This map acts on the circle by rotating it by angle α. Clearly, we have that T n(0) = nα mod 1 = {nα}, i.e. the fractional parts we considered in section 2 form the orbit of 0. Suppose that α = p/q is rational (here, p, q ∈ Z, q 6= 0). Then T q(x) = x + qp/q mod 1 = x + p mod 1 = x. Hence every point of R/Z is periodic. When α is irrational, one can show that every point x ∈ R/Z has a dense orbit. This can be deduced from uniform distribution, but it can also be proved directly. Exercise 3.1. Prove that, for an of the circle, every orbit is dense. (Recall that the orbit of x is dense if: for all y ∈ R/Z and for all ε > 0, there exists n > 0 such that d(T n(x), y) < ε.) (Hints: (1) First show that T n(x) = T n(0) + x and conclude that it’s sufficient to prove that the orbit of 0 is dense. (2) Prove that T n(x) 6= T m(x) for n 6= m. (3) Show that for each ε > 0 there exists n > 0 such that 0 < nα mod 1 < ε (you will need to remember that the circle is sequentially compact). (4) Now show that the orbit of 0 is dense.)

12 3.3 The doubling map 3 EXAMPLES OF DYNAMICAL SYSTEMS

3.3 The doubling map We have already seen the doubling map

T : R/Z 7→ R/Z : x 7→ 2x mod 1. (In multiplicative notation this is

T (exp 2πiθ) = exp 2πi(2θ). or, writing z = e2πiθ, T (z) = z 2.)

Proposition 3.2. Let T be the doubling map.

(i) There are 2n − 1 points of period n.

(ii) The periodic points are dense.

(iii) There exists a dense orbit.

Proof. We prove (i). Notice that

T n(x) = 2nx = x mod 1 if there exists an integer p > 0 such that

2nx = x + p.

Hence p x = . 2n − 1 We get distinct values of x ∈ [0, 1) for p = 0, 1,..., 2n − 2. Hence there are 2n − 1 periodic points. We leave (ii) as an exercise. Exercise 3.3. Prove (ii). We sketch the proof of (iii). Let us denote the interval [0, 1/2) by the symbol 0 and

denote the interval [1/2, 1) by 1. Let x ∈ [0, 1). For each n ≥ 0 let xn denote the symbol corresponding to the interval in which T n(x) lies. Thus to each x ∈ [0, 1) we associate a sequence (x0, x1,...) of 0s and 1s. It is easy to see that

∞ X xn x = 2n+1 n=0 so that the sequence (x0, x1,...) corresponds to the base 2 expansion of x. Notice that if x has coding (x0, x1,...) then

∞ ∞ ∞ X 2xn X xn+1 X xn+1 T (x) = 2x mod 1 = mod 1 = x + mod 1 = 2n+1 0 2n+1 2n+1 n=0 n=0 n=0

13 3.4 Shifts of finite type 3 EXAMPLES OF DYNAMICAL SYSTEMS

so that T (x) has expansion (x1, x2,...), i.e. T can be thought of as acting on the coding of x be shifting the associated sequence one place to the left.

For each n-tuple x0, x1, . . . , xn−1 let

k I(x0, . . . , xn−1) = {x ∈ [0, 1) | T (x) lies in interval xk for k = 0, 1, . . . , n − 1}.

That is, I(x0, . . . , xn−1) corresponds to the set of all x ∈ [0, 1) whose base 2 expansion starts x0, . . . , xn−1. We call I(x0, . . . , xn−1) a cylinder of rank n. Exercise 3.4. Draw all cylinders of length ≤ 4. One can show:

(i) a cylinder of rank n is an interval of length 2−n.

(ii) for each x ∈ [0, 1) with base 2 expansion x0, x1,..., the intervals I(x0, . . . , xn) ‘converge’ as n → ∞ (in an appropriate sense) to x.

From these observations it is easy to see that, in order to construct a dense orbit, it is sufficient to construct a point x such that for every cylinder I there exists n = n(I) such that T n(x) ∈ I. To do this, firstly write down all possible cylinders (there are countably many):

0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101, 110, 111, 0000, 0001,....

Now take x to be the point with base 2 expansion

010001101100000101001110010111011100000001 ...

(that is, just adjoin all the symbolic representations of all cylinders in some order). One can easily check that such a point x has a dense orbit. Exercise 3.5. Write down the proof of Proposition 3.2(iii), adding in complete details. Remark 3.6. This technique of coding the orbits of a given dynamical system by partitioning the space X and forming an itinerary map is a very powerful technique that can be used to study many different classes of dynamical system.

3.4 Shifts of finite type Let S = {1, 2, . . . , k} be a finite set of symbols. We will be interested in sets consisting of sequences of these symbols, subject to certain conditions. We will impose the following conditions: we assume that for each symbol i we allow certain symbols (depending only on i) to follow i and disallow all other symbols. This information is best recorded in a k × k matrix A with entries in {0, 1}. That is, we allow the symbol j to follow the symbol i if and only if the corresponding (i, j)th entry of the matrix A (denoted by Ai,j ) is equal to 1.

14 3.4 Shifts of finite type 3 EXAMPLES OF DYNAMICAL SYSTEMS

Definition 3.7. Let A be a k × k matrix with entries in {0, 1}. Let

+ ∞ + ΣA = {(xj )j=0 | Axi ,xi+1 = 1, for j ∈ Z } denote the set of all infinite sequences of symbols (xj ) where symbol j can follow symbol i + precisely when Ai,j = 1. We call ΣA a (one-sided) shift of finite type. Let ∞ ΣA = {(xj )j=−∞ | Axj ,xj+1 = 1, for j ∈ Z} denote the set of all bi-infinite sequences of symbols subject to the same conditions. We call

ΣA a (two-sided) shift of finite type.

+ Sometimes for brevity we refer to ΣA or ΣA as a shift space. + An alternative description of ΣA and ΣA can be given as follows. Consider a directed graph with vertex set {1, 2, . . . , k} and with a directed edge from vertex i to vertex j precisely when + Ai,j = 1. Then ΣA and ΣA correspond to the set of all infinite (respectively, bi-infinite) paths in this graph. Define + + + σ :ΣA → ΣA by + (σ (x))j = xj+1. + + Then σ takes a sequence in ΣA and shifts it one place to the left (deleting the first term), We call σ+ the (one-sided, left) shift map. There is a corresponding shift map on the two-sided shift space. Define

σ :ΣA → ΣA by

(σ(x))j = xj+1, so that σ shifts sequences one place to the left. Notice that in this case, we do not need to delete any terms in the sequence, We call σ the (two-sided, left) shift map. Notice that σ is invertible but σ+ is not. For ease of notation, we shall often write σ to denote both the one-sided and the two-sided shift map.

Examples 3.8. Take A to be the k × k matrix with each entry equal to 1, Then any symbol can follow any + other symbol. Hence ΣA is the space of all sequences of symbols {1, 2, . . . , k}. In this case + + we write Σk for ΣA and refer to it as the full one-sided k-shift. Similarly, we can define the full two-sided k-shift, Take A to be the matrix  1 1  . 1 0 + Then ΣA consists of all sequences of 1s and 2s subject to the condition that each 2 must be followed by a 1.

15 3.4 Shifts of finite type 3 EXAMPLES OF DYNAMICAL SYSTEMS

+ The following two exercises show that, for certain A, ΣA (or ΣA) can be rather uninter- esting. Exercise 3.9. Let  0 1  A = . 0 0

+ Show that ΣA is empty. Exercise 3.10. Let  1 1  A = . 0 1

+ Calculate ΣA . + The following conditions on A guarantee that ΣA (or ΣA) is more interesting than the examples in exercises 5.1 and 5.2.

Definition 3.11. Let A be a k × k matrix with entries in {0, 1}. We say that A is irreducible n n if for each i, j ∈ {1, 2, . . . , k} there exists n = n(i, j) > 0 such that (A )i,j > 0. (Here, (A )i,j denotes the (i, j)th entry of the nth power of A.)

Definition 3.12. Let A be a k × k matrix with entries in {0, 1}. We say that A is aperiodic n if there exists n > 0 such that for all i, j ∈ {1, 2, . . . , k} we have (A )i,j > 0. In graph-theoretic terms, the matrix A is irreducible if there exists a path along edges from any vertex to any other vertex. The matrix A is aperiodic if this path can be chosen to have the same length (i.e. consist of the same number of edges), irrespective of the two vertices chosen. Exercise 3.13. (i) Consider the matrix

 1 1  . 1 0

Draw the corresponding directed graph. Is this matrix irreducible? Is it aperiodic?

(ii) Consider the matrix  0 1 0 1   1 0 1 0    .  0 1 0 1  1 0 1 0 Draw the corresponding directed graph. Is this matrix irreducible? Is it aperiodic? Remark 3.14. These shift spaces may seem very strange at first sight—it takes a long time to get used to them. However (as we shall see) they are particularly tractable examples of chaotic dynamical systems. Moreover, a wide class of dynamical systems (notably hyperbolic dynamical systems) can be modeled in terms of shifts of finite type. We have already seen a particularly simple example of this: the doubling map can be modeled by the full one-sided 2-shift.

16 3.5 Periodic points 3 EXAMPLES OF DYNAMICAL SYSTEMS

3.5 Periodic points

∞ + A sequence x = (xj )j=0 ∈ ΣA is periodic for the shift σ if there exists n > 0 such that σnx = x. One can easily check that this means that

+ xj = xj+n for all j ∈ Z .

That is, the sequence x is determined by a finite block of symbols x0, . . . , xn−1 and

x = (x0, x1, . . . , xn−1, x0, x1, . . . , xn−1,..., ).

Exercise 3.15. Consider the full one-sided k-shift. How many periodic points of period n are there?

3.6 Cylinders Later on we will need a particularly tractable class of subsets of shift spaces. These are the

cylinder sets and are formed by fixing a finite set of co-ordinates. More precisely, in ΣA we define

[y−m, . . . , y−1, y0, y1, . . . , yn]−m,n = {x ∈ ΣA | xj = yj , −m ≤ j ≤ n}, + and in ΣA we define

+ [y0, y1, . . . , yn]0,n = {x ∈ ΣA | xj = yj , 0 ≤ j ≤ n}.

+ 3.7 A metric on ΣA + What does it mean for two sequences in ΣA to be ‘close’? Heuristically we will say that two ∞ ∞ sequences (xj )j=0 and (yj )j=0 are close if they agree for a large number of initial places. ∞ ∞ + More formally, for two sequences x = (xj )j=0, y = (yj )j=0 ∈ ΣA we define n(x, y) by setting n(x, y) = n if xj = yj for j = 0, . . . , n − 1 but xn 6= yn. Thus n(x, y) is the first place in which the sequences x and y disagree. (We set n(x, y) = ∞ if x = y.) + We define a metric d on ΣA by

1n(x,y) d((x )∞ , (y )∞ ) = if x 6= y j j=0 j j=0 2

∞ ∞ and d((xj )j=0, (yj )j=0) = 0 if x = y. Exercise 3.16. Show that this is a metric. ∞ In the two-sided case, we can define a metric in a similar way. Let x = (xj )j=−∞, y = ∞ (yj )j=−∞ ∈ ΣA. Define n(x, y) by setting n(x, y) = n if xj = yj for |j| ≤ n − 1 and either xn 6= yn or x−n 6= y−n. Thus n(x, y) is the first place, going either forwards or backwards, in which the sequences x, y disagree. (We again set n(x, y) = ∞ if x = y.)

17 + 3.7 A metric on ΣA 3 EXAMPLES OF DYNAMICAL SYSTEMS

We define a metric d on ΣA in the same way:

1n(x,y) d((x )∞ , (y )∞ ) = if x 6= y j j=−∞ j j=−∞ 2

∞ ∞ and d((xj )j=−∞, (yj )j=−∞) = 0 if x = y.

+ Theorem 3.17. Let ΣA be a shift of finite type.

+ (i) ΣA is a compact metric space. (ii) The shift map σ is continuous.

Remark 3.18. The corresponding statements for the two-sided case also hold.

+ + Proof. (i) If ΣA = ∅ or if ΣA finite then trivially it is compact. Thus we may assume that + ΣA is infinite. (m) + Let x ∈ ΣA be a sequence (in reality, a sequence of sequences!). We need to show (m) + Sk that x has a convergent subsequence. Since ΣA = i=1[i] at least one cylinder [i] (m) contains infinitely many elements of the sequence x ; call this [y0]. Thus there are (m) infinitely m for which x ∈ [y0]. S Since [y0] = [y0i] we similarly obtain a cylinder of length 2, [y0y1] say, containing Ay0,i =1 infinitely many elements of the sequence x (m).

Continue inductively in this way to obtain a nested family of cylinders [y0, . . . , yn], n ≥ 0, each containing infinitely many elements of the sequence x (m). ∞ + Set y = (yn)n=0 ∈ ΣA . Then for each n ≥ 0, there exist infinitely many m for which d(y, x (m)) ≤ (1/2)m. Thus y is the limit of some subsequence of x (m).

(ii) We want to show the following: ∀ε > 0 ∃δ > 0 s.t. d(x, y) < δ ⇒ d(σ(x), σ(y)) < ε. Let ε > 0. Choose n such that 1/2n < ε. Let δ = 1/2n+1. Suppose that d(x, y) < δ. Then n(x, y) > n + 1, so that x and y agree in the first n + 1 places. Hence σ(x) and σ(y) agree in the first n places, so that n(σ(x), σ(y)) > n. Hence d(σ(x), σ(y)) = 1/2n(σ(x).σ(y)) < 1/2n < ε.

Exercise 3.19. Let A be an irreducible k × k matrix with entries in {0, 1}. Show that the set + of all periodic points for σ is dense in ΣA . (Recall that a subset Y of a set X is said to be dense if: for all x ∈ X and for all ε > 0 there exists y ∈ Y such that d(x, y) < ε, i.e. any point of X can be arbitrarily well approximated by a point of Y .) Exercise 3.20. Let A be an irreducible k × k matrix with entries in {0, 1}. Show that there + exists a point x ∈ ΣA with a dense orbit. (Hint: first show that if the orbit of a point visits each cylinder then it is dense. To construct such a point, mimic the argument used for the doubling map above. Use irreducibility to show that one can concatenate cylinders together by inserting finite strings of symbols between them.)

18 3.8 The continued fraction map 3 EXAMPLES OF DYNAMICAL SYSTEMS

3.8 The continued fraction map Every x ∈ (0, 1) can be expressed as a continued fraction: 1 x = 1 (1) x0 + 1 x1+ x + 1 2 x3+···

for xn ∈ N. For example, √ −1 + 5 1 = 1 2 1 + 1 1+ 1 1+ 1+··· 3 1 = 1 4 1 + 3 1 π = 3 + 1 7 + 1 15+ 1 1+ 292+··· One can show that rational numbers have a finite continued fraction expansion (that

is, the above expression terminates at xn for some n). Conversely, it is clear that a finite continued fraction expansion gives rise to a rational number. Thus each irrational x ∈ (0, 1) has an infinite continued fraction expansion of the form (1). Moreover, one can show that this expansion is unique. For brevity, we will sometime

write (1) as x = [x0; x1; x2; ...]. Recall that earlier in this section we saw how the doubling map x 7→ 2x mod 1 can be used to determine the base 2 expansion of x. Here we introduce a dynamical system that allows us to determine the continued fraction expansion of x.

We can read off the numbers xi from the transformation T : [0, 1] → [0, 1] defined by T (0) = 0 and, for 0 < x < 1, 1 T (x) = mod 1. x Then 1  1   1  x = , x = , . . . , x = . 0 x 1 T x n T nx This is called the continued fraction map or the Gauss map. Exercise 3.21. Draw the graph of the continued fraction map. Later in the course we will study the ergodic theoretic properties of the continued fraction map and use them to deduce some interesting facts about continued fractions

3.9 Endomorphisms of a torus

Take X = Rk /Zk to be the k-torus.

19 3.9 Endomorphisms of a torus 3 EXAMPLES OF DYNAMICAL SYSTEMS

Let A = (aij ) be a k × k matrix with entries in Z and with det A 6= 0. We can define a linear map k → k by R R     x1 x1  .  A  .  .  .  7→  .  xk xk

For brevity, we shall often write this as (x1, . . . , xk ) 7→ A(x1, . . . , xk ). Since A is an integer matrix, it maps Zk to itself. We claim that A allows us to define a map

k k k k T = TA : R /Z → R /Z

(x1, . . . , xk ) 7→ A(x1, . . . , xk ) mod 1.

To see that this map is well defined, we need to check that if x, y ∈ Rk determine the same point in Rk /Zk then Ax mod 1 and Ay mod 1 are the same point in Rk /Zk . But this is clear: if x, y ∈ Rk give the same point in the torus, then x = y + n for some n ∈ Zk . Hence Ax = A(y + n) = Ay + An. As A maps Zk to itself, we see that An ∈ Zk so that Ax, Ay determine the same point in the torus.

Definition 3.22. Let A = (aij ) denote a k ×k matrix with integer entries such that det A 6= 0. k k k k Then we call the map TA : R /Z → R /Z a linear toral endomorphism. The map T is not invertible in general. However, if det A = ±1 then A−1 exists and is an integer matrix. Hence we have a map T −1 given by

−1 −1 T (x1, . . . , xk ) = A (x1, . . . , xk ) mod 1.

One can easily check that T −1 is the inverse of T .

Definition 3.23. Let A = (aij ) denote a k × k matrix with integer entries such that det A = k k k k ±1. Then we call the map TA : R /Z → R /Z a linear toral automorphism. Example 3.24. Take A to be the matrix

 2 1  A = 1 1 and define T : R2/Z2 → R2/Z2 to be the induced map:

T (x1, x2) = (2x1 + x2 mod 1, x1 + x2 mod 1).

Then T is a linear toral automorphism and is called Arnold’s cat map. (CAT stands for ‘C’ontinuous ‘A’utomorphism of the ‘T’orus.)

Definition 3.25. Suppose that det A = ±1. Then we call T a hyperbolic toral automorphism if A has no eigenvalues of modulus 1.

20 3.9 Endomorphisms of a torus 3 EXAMPLES OF DYNAMICAL SYSTEMS

Exercise 3.26. Check that Arnold’s cat map is hyperbolic. Decide whether the following matrices give hyperbolic toral automorphisms:

 1 1   1 1  A = ,A = . 1 0 1 2 1 0

Let us consider the special case of a toral automorphism of the 2-dimensional torus R2/Z2.

Proposition 3.27. Let T be a hyperbolic toral automorphism of R2/Z2 with corresponding matrix A having eigenvalues λ1, λ2.

(i) The periodic points of T correspond precisely with the set of rational points of R2/Z2: p p   1 , 2 + 2 | p , p , q ∈ , 0 ≤ p , p < q . q q Z 1 2 N 1 2

(In particular, the periodic points are dense.)

(ii) Suppose that det A = 1. Then the number of points of period n is given by:

2 2 n n n card{x ∈ R /Z | T (x) = x} = |λ1 + λ2 − 2|.

Proof. (i) If (x1, x2) = (p1/q, p2/q) has rational co-ordinates then we can write ! p(n) p(n) T n(x , x ) = 1 , 2 1 2 q q

(n) (n) 2 where 0 ≤ p1 , p2 < q are integers. As there are at most q distinct possibilities for (n) (n) p1 , p2 , this sequence (in n) must be eventually periodic. Hence there exists n1 > n0 n1 n0 n1−n0 such that T (x1, x2) = T (x1, x2). As T is invertible, we see that T (x1, x2) = (x1, x2) so that (x1, x2) is periodic. 2 2 n Conversely, If (x1, x2) ∈ R /Z is periodic then T (x1, x2) = (x1, x2) for some n > 0. Hence  x   x   n  An 1 = 1 + 1 (2) x2 x2 n2 n for some n1, n2 ∈ Z. As A is hyperbolic, A has no eigenvalues of modulus 1. Hence A has no eigenvalues of modulus 1, and in particular 1 is not an eigenvalue. Hence An − I is invertible. Hence solutions to (2) have the form

 x   n  1 = (An − I)−1 1 . x2 n2

n n −1 As A − I has entries in Z, the matrix (A − I) has entries in Q. Hence x1, x2 ∈ Q.

21 3.9 Endomorphisms of a torus 3 EXAMPLES OF DYNAMICAL SYSTEMS

(ii) A point (x1, x2) is periodic with period n for T if and only if

 x   n  (An − I) 1 = 1 . (3) x2 n2

n n n We may take x1, x2 ∈ [0, 1). Let u = (A −I)(0, 1), v = (A −I)(1, 0). The map A −I maps [0, 1) × [0, 1) onto the parallelogram

R = {αu + βv | 0 ≤ α, β < 1}.

n For the point (x1, x2) ∈ [0, 1) × [0, 1) to be periodic, it follows from (3) that (A − I)(x1, x2) must be an integer point of R. Thus the number of periodic points of period n correspond to the number of integer points in R. One can check that the number of such points is equal to the area of R. Hence that number of periodic points of period n is given by | det(An − I)|. Let us calculate the eigenvalues of An − I. Let µ be an eigenvalue of An − I with eigenvector v. Then (An − I)v = µv ⇔ Anv = (µ + 1)v

n so that µ + 1 is an eigenvalue of A . As the eigenvalues of A are given by λ1, λ2, the n n n n n n eigenvalues of A are given by λ1, λ2. Hence the eigenvalues of A −I are λ1 −1, λ2 −1. As the determinant of a matrix is given by the product of the eigenvalues, we have that

n n n | det(A − I)| = |(λ1 − 1)(λ2 − 1)| n n n = |(λ1λ2) + 1 − (λ1 + λ2)| n n = λ1 + λ2 − 2,

as λ1λ2 = det A = 1.

22 4 MEASURE THEORY

4 Measure Theory

4.1 Background In section 1 we remarked that ergodic theory is the study of the qualitative distributional properties of typical orbits of a dynamical system and that these properties are expressed in terms of measure theory. Measure theory therefore lies at the heart of ergodic theory. However, we will not need to know the (many!) intricacies of measure theory and this section will be devoted to an expository account of the required facts.

4.2 Measure spaces Loosely speaking, a measure is a function that, when given a subset of a space X, will say how ‘big’ that subset is. A motivating example is given by Lebesgue measure. The Lebesgue measure of an interval is given by its length. In defining an abstract measure space, we will be taking the properties of ‘length’ (or, in higher dimensions, ‘volume’) and abstracting them, in much the same way that a metric space abstracts the properties of ‘distance’. It turns out that in general it is not possible to be able to define the measure of an arbitrary subset of X. Instead, we will usually have to restrict our attention to a class of subsets of X.

Definition 4.1. A collection B of subsets of X is called a σ-algebra if the following properties hold:

(i) ∅ ∈ B,

(ii) if E ∈ B then its complement X \ E ∈ B,

(iii) if En ∈ B, n = 1, 2, 3,..., is a countable sequence of sets in B then their union S∞ n=1 En ∈ B. Examples 4.2. The trivial σ-algebra is given by B = {∅,X}. The full σ-algebra is given by B = P(X), i.e. the collection of all subsets of X.

Here are some easy properties of σ-algebras:

Lemma 4.3. Let B be a σ-algebra of subsets of X. Then

(i) X ∈ B; T∞ (ii) if En ∈ B then n=1 En ∈ B. Exercise 4.4. Prove Lemma 4.3. In the special case when X is a compact metric space there is a particularly important σ-algebra.

23 4.3 The Kolmogorov Extension Theorem 4 MEASURE THEORY

Definition 4.5. Let X be a compact metric space. We define the Borel σ-algebra B(X) to be the smallest σ-algebra of subsets of X which contains all the open subsets of X.

Remarks 4.6. By ‘smallest’ we mean that if C is another σ-algebra that contains all open subsets of X then B(X) ⊂ C. We say that the Borel σ-algebra is generated by the open sets. We call sets in B(X) a Borel set. By Definition 4.1(ii), the Borel σ-algebra also contains all closed sets and is the smallest σ-algebra with this property. Let X be a set and let B be a σ-algebra of subsets of X.

Definition 4.7. A function µ : B → R+ ∪ {∞} is called a measure if: (i) µ(∅) = 0;

(ii) if En is a countable collection of pairwise disjoint sets in B (i.e. En ∩ Em = ∅ for n 6= m) then ∞ ! ∞ [ X µ En = µ(En). n=1 n=1 (If µ(X) < ∞ then we call µ a finite measure.) We call (X, B, µ) a measure space. If µ(X) = 1 then we call µ a probability or probability measure and refer to (X, B, µ) as a probability space.

Remark 4.8. Thus a measure just abstracts properties of ‘length’ or ‘volume’. Condition (i) says that the empty set has zero length, and condition (ii) says that the length of a disjoint union is the sum of the lengths of the individual sets.

Definition 4.9. We say that a property holds if the set of points on which the property fails to hold has measure zero.

We will usually be interested in studying measures on the Borel σ-algebra of a compact metric space X. To define such a measure, we need to define the measure of an arbitrary Borel set. In general, the Borel σ-algebra is extremely large. In the next section we see that it is often unnecessary to do this and instead it is sufficient to define the measure of a certain class of subsets.

4.3 The Kolmogorov Extension Theorem A collection A of subsets of X is called an algebra if:

(i) ∅ ∈ A,

(ii) if A, B ∈ A then A ∩ B ∈ A;

24 4.3 The Kolmogorov Extension Theorem 4 MEASURE THEORY

(iii) if A ∈ A then Ac ∈ A.

Thus an algebra is like a σ-algebra, except that we do not assume that A is closed under countable unions.

Example 4.10. Take X = [0, 1], and A = {all finite unions of subintervals}.

Let B(A) denote the σ-algebra generated by A, i.e., the smallest σ-algebra containing A. (In the above example B(A) is the Borel σ-algebra.)

Theorem 4.11 (Kolmogorov Extension Theorem). Let A be an algebra of subsets of X. Suppose that µ : A → R+ satisfies: (i) µ(∅) = 0; S (ii) there exists finitely or countably many sets Xn ∈ A such that X = n Xn and µ(Xn) < ∞; S∞ (iii) if En ∈ A, n ≥ 1, are pairwise disjoint and if n=1 En ∈ A then

∞ ! ∞ [ X µ En = µ(En). n=1 n=1

Then there is a unique measure µ : B(A) → R+ which is an extension of µ : A → R+. Remarks 4.12. (i) The important hypotheses are (i) and (iii). Thus the Kolmogorov Extension Theorem says that if we have a function µ that looks like a measure on an algebra A, then it is indeed a measure when extended to B(A). (ii) We will often use the Kolmogorov Extension Theorem as follows. Take X = [0, 1] and take A to be the algebra consisting of all finite unions of subintervals of X. We then define the ‘measure’ µ of a subinterval in such a way as to be consistent with the hypotheses of the Kolmogorov Extension Theorem. It then follows that µ does indeed define a measure on the Borel σ-algebra. (iii) Here is another way in which we shall use the Kolmogorov Extension Theorem. Suppose we have two measures, µ and ν, and we want to see if µ = ν. A priori we would have to check that µ(B) = ν(B) for all B ∈ B. The Kolmogorov Extension Theorem says that it is sufficient to check that µ(E) = ν(E) for all E in an algebra A that generates B. For example, to show that two measures on [0, 1] are equal, it is sufficient to show that they give the same measure to each subinterval.

25 4.4 Examples of measure spaces 4 MEASURE THEORY

4.4 Examples of measure spaces Lebesgue measure on [0, 1]. Take X = [0, 1] and take A to be the collection of all finite unions of subintervals of [0, 1]. For a subinterval [a, b] define

µ([a, b]) = b − a.

This satisfies the hypotheses of the Kolmogorov Extension Theorem, and so defines a measure on the Borel σ-algebra B. This is Lebesgue measure. Lebesgue measure on R/Z. Take X = R/Z = [0, 1) mod 1 and take A to be the collection of all finite unions of subintervals of [0, 1). For a subinterval [a, b] define

µ([a, b]) = b − a.

This satisfies the hypotheses of the Kolmogorov Extension Theorem, and so defines a measure on the Borel σ-algebra B. This is Lebesgue measure on the circle. Lebesgue measure on the k-dimensional torus. Take X = Rk /Zk = [0, 1)k mod 1 and Qk take A to be the collection of all finite unions of k-dimensional sub-cubes j=1[aj , bj ] of k Qk k [0, 1) . For a sub-cube j=1[aj , bj ] of [0, 1) , define

k k Y Y µ( [aj , bj ]) = (bj − aj ). j=1 j=1 This satisfies the hypotheses of the Kolmogorov Extension Theorem, and so defines a measure on the Borel σ-algebra B. This is Lebesgue measure on the torus. Stieltjes measures. Take X = [0, 1] and let ρ : [0, 1] → R+ be an increasing function such that ρ(1) − ρ(0) = 1. Take A to be the algebra of finite unions of subintervals and define

µρ([a, b]) = ρ(b) − ρ(a).

This satisfies the hypotheses of the Kolmogorov Extension Theorem, and so defines a measure on the Borel σ-algebra B. We say that µρ is the measure on [0, 1] with density ρ. Dirac measures. Finally, we give an example of a class of measures that do not fall into the above categories. Let X be an arbitrary space and let B be an arbitrary σ-algebra. Let x ∈ X. Define the measure δx by  1 if x ∈ A δ (A) = x 0 if x 6∈ A.

Then δx defines a probability measure. It is called the Dirac measure at x.

4.5 Integration: The Riemann integral Before discussing the Lebesgue theory of integration, we briefly review the construction of the Riemann integral. This gives a method for defining the integral of (sufficiently nice) functions defined on [a, b]. In the next subsection we will see how the Lebesgue integral is a

26 4.5 Integration: The Riemann integral 4 MEASURE THEORY generalisation of the Riemann integral, in the sense that it allows us to integrate functions defined on spaces more general than subintervals of R (as well as a wider class of functions). However, the Lebesgue integral has other nice properties, for example it is well-behaved with respect to limits. Here we give a brief exposition about some inadequacies of the Riemann integral and how they motivate the Lebesgue integral. Let f :[a, b] → R be a bounded function (for the moment we impose no other conditions on f ).

A partition ∆ of [a, b] is a finite set of points ∆ = {x0, x1, x2, . . . , xn} with

a = x0 < x1 < x2 < ··· < xn = b. In other words, we are dividing [a, b] up into subintervals. We then form the upper and lower Riemann sums

n−1 X U(f , ∆) = sup f (x)(xi+1 − xi ), i=0 x∈[xi ,xi+1] n−1 X L(f , ∆) = inf f (x)(xi+1 − xi ). x∈[x ,x ] i=0 i i+1 The idea is then that if we make the subintervals in the partition small, these sums will be a good approximation to (our intuitive notion of) the integral of f over [a, b]. More precisely, if inf U(f , ∆) = sup L(f , ∆), ∆ ∆ where the infimum and supremum are taken over all possible partitions of [a, b], then we write Z b f (x) dx a for their common value and call it the (Riemann) integral of f between those limits. We also say that f is Riemann integrable. The class of Riemann integrable functions includes continuous functions and step functions (i.e. finite linear combinations of characteristic functions of intervals). However, there are many functions for which one wishes to define an integral but which are not Riemann integrable, making the theory rather unsatisfactory. For example, define f : [0, 1] → R by  1 if x ∈ f (x) = Q 0 otherwise. Since between any two distinct real numbers we can find both a rational number and an irrational number, given 0 ≤ y < z ≤ 1, we can find y < x < z with f (x) = 1 and 0 0 y < x < z with f (x ) = 0. Hence for any partition ∆ = {x0, x1, . . . , xn} of [0, 1], we have

n−1 X U(f , ∆) = (xi+1 − xi ) = 1, i=0 L(f , ∆) = 0.

27 4.6 Integration: The Lebesgue integral 4 MEASURE THEORY

Taking the infimum and supremum, respectively, over all partitions ∆, shows that f is not Riemann integrable. Why does Riemann integration not work for the above function and how could we go about improving it? Let us look again at (and slightly rewrite) the formulæ for U(f , ∆) and L(f , ∆). We have n−1 X U(f , ∆) = sup f (x) l([xi , xi+1]) i=0 x∈[xi ,xi+1] and n−1 X L(f , ∆) = inf f (x) l([xi , xi+1]), x∈[x ,x ] i=0 i i+1 where, for an interval [y, z], l([y, z]) = z − y denotes its length. In the example above, things didn’t work because dividing [0, 1] into intervals (no matter how small) did not ‘separate out’ the different values that f could take. But suppose we had a notion of ‘length’ that worked for more general sets than intervals. Then we could do better by considering more complicated ‘partitions’ of [0, 1], where by

partition we now mean a collection of subsets {E1,...,Em} of [0, 1] such that Ei ∩ Ej = ∅, Sm if i 6= j, and i=1 Ei = [0, 1]. In the example, for instance, it might be reasonable to write

Z 1 f (x) dx = 1 × l([0, 1] ∩ Q) + 0 × l([0, 1]\Q) 0 = l([0, 1] ∩ Q).

Instead of using subintervals, the Lebesgue integral uses a much wider class of subsets (namely sets in the given σ-algebra) together with a notion of ‘generalised length’ (namely, measure).

4.6 Integration: The Lebesgue integral Let (X, B, µ) be a measure space. We are interested in how to integrate functions defined on X with respect to the measure µ. In the special case when X = [0, 1], B is the Borel σ-algebra and µ is Lebesgue measure, this will extend the definition of the Riemann integral to functions that are not Riemann integrable.

Definition 4.13. A function f : X → R is measurable if f −1(D) ∈ B for every Borel subset D of R, or, equivalently, if f −1(c, ∞) ∈ B for all c ∈ R. A function f : X → C is measurable if both the real and imaginary parts, Ref and Imf , are measurable.

We define integration via simple functions.

28 4.7 Examples 4 MEASURE THEORY

Definition 4.14. A function f : X → R is simple if it can be written as a linear combination of characteristic functions of sets in B, i.e.:

r X f = ai χAi , i=1 for some ai ∈ R, Ai ∈ B, where the Ai are pairwise disjoint.

For a simple function f : X → R we define

r Z X f dµ = ai µ(Ai ) i=1 (which can be shown to be independent of the representation of f as a simple function). Thus for simple functions, the integral can be thought of as being defined to be the area underneath the graph. If f : X → R, f ≥ 0, is measurable then one can show that there exists an increasing sequence of simple functions fn such that fn ↑ f pointwise as n → ∞ (i.e. for every x, fn(x) is an increasing sequence and fn(x) → f (x) as n → ∞) and we define Z Z f dµ = lim fn dµ. n→∞

This can be shown to be independent of the choice of sequence fn. For an arbitrary measurable function f : X → R, we write f = f + − f −, where f + = max{f , 0} ≥ 0 and f − = max{−f , 0} ≥ 0 and define Z Z Z f dµ = f + dµ − f − dµ.

Finally, for a measurable function f : X → C, we define Z Z Z f dµ = Ref dµ + i Imf dµ.

We say that f is integrable if Z |f | dµ < +∞.

4.7 Examples Lebesgue measure. Let X = [0, 1] and let µ denote Lebesgue measure on the Borel σ- algebra. If f : [0, 1] → R is Riemann integrable then it is also Lebesgue integrable and the two definitions agree. The Stieltjes integral. Let ρ : [0, 1] → R+ and suppose that ρ is differentiable. Then Z Z 0 f dµρ = f (x)ρ (x) dx.

29 4.8 The Lp Spaces 4 MEASURE THEORY

Integration with respect to Dirac measures. Let x ∈ X and recall that we defined the Dirac measure by  1 if x ∈ A δ (A) = x 0 if x 6∈ A.

If χA denotes the characteristic function of A then

Z  1 if x ∈ A χ dδ = A x 0 if x 6∈ A. P R Hence if f = ai χAi is a simple function then f dδx = ai where x ∈ Ai . Now let f : X → R. By choosing an increasing sequence of simple functions, we see that Z f dδx = f (x).

4.8 The Lp Spaces

Let us say that two measurable functions f , g : X → C are equivalent if f = g µ-a.e. We shall write L1(X, B, µ) (or L1(µ)) for the set of equivalence classes of integrable functions on (X, B, µ). We define Z kf k1 = |f | dµ.

1 Then d(f , g) = kf − gk1 is a metric on L (X, B, µ). More generally, for any p ≥ 1, we can define the space Lp(X, B, µ) consisting of (equiva- lence classes of) measurable functions f : X → C such that |f |p is integrable. We can again p define a metric on L (X, B, µ) by defining d(f , g) = kf − gkp where Z 1/p p kf kp = |f | dµ .

It is worth remarking that convergence in Lp neither implies nor is implied by convergence almost everywhere. If (X, B, µ) is a finite measure space and if 1 ≤ p < q then

Lq(X, B, µ) ⊂ Lp(X, B, µ).

Apart from L1, the most interesting Lp space is L2(X, B, µ). This is a with the inner product Z hf , gi = f g¯ dµ.

Remark 4.15. We shall continually abuse notation by saying that, for example, a function f ∈ L1(X, B, µ) when, strictly speaking, we mean that the equivalence class of f lies in L1(X, B, µ). 1 Exercise 4.16. Give an example of a sequence of functions fn ∈ L ([0, 1], B, µ) (µ = Lebesgue) 1 such that fn → 0 µ-a.e. but fn 6→ 0 in L .

30 4.9 Convergence theorems 4 MEASURE THEORY

Exercise 4.17. Give an example to show that

2 1 L (R, B, µ) 6⊂ L (R, B, µ) where µ is Lebesgue measure.

4.9 Convergence theorems We state the following two convergence theorems for integration.

Theorem 4.18 (Monotone Convergence Theorem). Suppose that fn : X → R is an increasing R sequence of integrable functions on (X, B, µ). If fn dµ is a bounded sequence of real numbers then limn→∞ fn exists µ-a.e. and is integrable and Z Z lim fn dµ = lim fn dµ. n→∞ n→∞

Theorem 4.19 (Dominated Convergence Theorem). Suppose that g : X → R is integrable and that fn : X → R is a sequence of measurable functions functions with |fn| ≤ g µ-a.e. and limn→∞ fn = f µ-a.e. Then f is integrable and Z Z lim fn dµ = f dµ. n→∞

Remark 4.20. Both the Monotone Convergence Theorem and the Dominated Convergence Theorem fail for Riemann integration.

31 5 MEASURES ON COMPACT METRIC SPACES

5 Measures on Compact Metric Spaces

5.1 The Riesz Representation Theorem Let X be a compact metric space and let

C(X, R) = {f : X → R | f is continuous} denote the space of all continuous functions on X. Equip C(X, R) with the metric

d(f , g) = kf − gk∞ = sup |f (x) − g(x)|. x∈X Let B denote the Borel σ-algebra on X and let µ be a probability measure on (X, B). Then we can think of µ as a functional that acts on C(X, R), namely Z C(X, R) → R : f 7→ f dµ.

We will often write µ(f ) for R f dµ. Notice that this map enjoys several natural properties:

(i) the functional defined by µ is continuous: i.e. if fn ∈ C(X, R) and fn → f then µ(fn) → µ(f ).

(i’) the functional defined by µ is bounded: i.e. if f ∈ C(X, R) then |µ(f )| ≤ kf k∞. (ii) the functional defined by µ is linear:

µ(λ1f1 + λ2f2) = λ1µ(f1) + λ2µ(f2)

where λ1, λ2 ∈ R and f1, f2 ∈ C(X, R). (iii) if f ≥ 0 then µ(f ) ≥ 0 (i.e. the map µ is positive);

(iv) consider the function 1 defined by 1(x) ≡ 1 for all x; then µ(1) = 1 (i.e. the map µ is normalised).

Exercise 5.1. Prove the above assertions. Remark 5.2. It can be shown that a linear functional if continuous if and only if it is bounded. Thus in the presence of (ii), we have that (i) is equivalent to (i’). The Riesz Representation Theorem says that the above properties characterise all Borel probability measures on X. That is, if we have a map w : C(X, R) → R that satisfies the above four properties, then w must be given by integrating with respect to a Borel probability measure. This will be a very useful method of constructing measures: we need only construct continuous positive normalised linear functionals!

Theorem 5.3 (Riesz Representation Theorem). Let w : C(X, R) → R be a functional such that:

32 5.2 The space M(X) 5 MEASURES ON COMPACT METRIC SPACES

(i) w is bounded: i.e. for all f ∈ C(X, R) we have |w(f )| ≤ kf k∞;

(ii) w is linear: i.e. w(λ1f1 + λ2f2) = λ1w(f1) + λ2w(f2);

(iii) w is positive: i.e. if f ≥ 0 then w(f ) ≥ 0;

(iv) w is normalised: i.e. w(1) = 1.

Then there exists a Borel probability measure µ ∈ M(X) such that Z w(f ) = f dµ.

Moreover, µ is unique.

5.2 The space M(X) In all of the examples that we shall consider, X will be a compact metric space and B will be the Borel σ-algebra. We will also be interested in the space of continuous R-valued functions

C(X, R) = {f : X → R | f is continuous}.

This space is also a metric space. We can define a metric on C(X, R) by first defining

kf k∞ = sup |f (x)| x∈X and then defining

ρ(f , g) = kf − gk∞. This metric turns C(X, R) into a complete metric spaces. (Recall that a metric space is said to be complete if every Cauchy sequence is convergent.) Note also that C(X, R) is a vector space. An important property of C(X, R) that will prove to be useful later on is that it is separable, that is, it contains countable dense subsets. Rather than fixing one measure on (X, B), it is interesting to consider the totality of possible (probability) measures. To formalise this, let M(X) denote the set of all probability measures on (X, B). The following simple fact will be useful later on.

Proposition 5.4. The space M(X) is convex: if µ1, µ2 ∈ M(X) and 0 ≤ α ≤ 1 then αµ1 + (1 − α)µ2 ∈ M(X). Exercise 5.5. Prove the above proposition.

33 5.3 The weak∗ topology on M(X) 5 MEASURES ON COMPACT METRIC SPACES

5.3 The weak∗ topology on M(X) It will be very important to have a sensible notion of convergence in M(X); this is called ∗ ∗ weak convergence. We say that a sequence of probability measures µn weak converges to µ, as n → ∞ if, for every f ∈ C(X, R), Z Z f dµn → f dµ, as n → ∞.

∗ If µn weak converges to µ then we write µn * µ. (Note that with this definition it is not necessarily true that µn(B) → µ(B), as n → ∞, for B ∈ B.) We can make M(X) into a metric space compatible with this definition of convergence by choosing a countable dense ∞ subset {fn}n=1 ⊂ C(X, R) and, for µ, m ∈ M(X), setting ∞ Z Z X 1 d(µ, m) = fn dµ − fn dm . 2nkf k n=1 n ∞ However, we will not need to work with a particular metric: what is important is the definition of convergence. Notice that there is a continuous embedding of X in M(X) given by the map X → M(X): x 7→ δx , where δx is the Dirac measure at x:  1 if x ∈ A, δ (A) = x 0 if x∈ / A, R (so that f dδx = f (x)). Exercise 5.6. Show that the map δ : X → M(X) is continuous. (Hint: This is really just unravelling the underlying definitions.) Exercise 5.7. Let X be a compact metric space. For µ ∈ M(X) define Z

kµk = sup f dµ . f ∈C(X,R),kf k∞≤1

We say that µn converges strongly to µ if kµn − µk → 0 as n → ∞. The topology this determines is called the strong topology (or the operator topology).

∗ (i) Show that if µn → µ strongly then µn * µ in the weak topology.

(ii) Show that X,→ M(X): x 7→ δx is not continuous in the strong topology.

(iii) Prove that kδx − δy k = 2 if x 6= y. (You may use Urysohn’s Lemma: Let A and B be disjoint closed subsets of a metric space X. Then there is a continuous function f ∈ C(X, R) such that 0 ≤ f ≤ 1 on X while f ≡ 0 on A and f ≡ 1 on B.) Hence prove that M(X) is not compact in the strong topology when X is infinite.

Exercise 5.8. Give an example of a sequence of measures µn and a set B such that µn * µ but µn(B) 6→ µ(B).

34 5.4 M(X) is weak∗ compact 5 MEASURES ON COMPACT METRIC SPACES

5.4 M(X) is weak∗ compact We can use the Riesz Representation Theorem to establish another important property of M(X): that it is compact. Theorem 5.9. Let X be a compact metric space. Then M(X) is weak∗ compact. Proof. In fact, we shall show that M(X) is sequentially compact, i.e., that any sequence R µn ∈ M(X) has a convergent subsequence. For convenience, we shall write µ(f ) = f dµ. ∞ Since C(X, R) is separable, we can choose a countable dense subset of functions {fi }i=1 ⊂ C(X, R). Given a sequence µn ∈ M(X), we shall first consider the sequence of real numbers µn(f1) ∈ R. We have that |µn(f1)| ≤ kf1k∞ for all n, so µn(f1) is a bounded sequence of real (1) numbers. As such, it has a convergent subsequence, µn (f1) say. (1) (1) Next we apply the sequence of measures µn to f2 and consider the sequence µn (f2) ∈ R. Again, this is a bounded sequence of real numbers and so it has a convergent subsequence (2) µn (f2). (i) (i−1) In this way we obtain, for each i ≥ 1, nested subsequences {µn } ⊂ {µn } such that (i) (n) µn (fj ) converges for 1 ≤ j ≤ i. Now consider the diagonal sequence µn . Since, for n ≥ i, (n) (i) (n) µn is a subsequence of µn , µn (fi ) converges for every i ≥ 1. (n) We can now use the fact that {fi } is dense to show that µn (f ) converges for all f ∈ (n) C(X, R), as follows. For any ε > 0, we can choose fi such that kf − fi k∞ ≤ ε. Since µn (fi ) converges, there exists N such that if n, m ≥ N then

(n) (m) |µn (fi ) − µm (fi )| ≤ ε. Thus if n, m ≥ N we have

(n) (m) (n) (n) (n) (m) |µn (f ) − µm (f )| ≤ |µn (f ) − µn (fi )| + |µn (fi ) − µm (fi )| (m) (m) + |µm (fi ) − µm (f )| ≤ 3ε,

(n) so µn (f ) converges, as required. (n) To complete the proof, write w(f ) = limn→∞ µn (f ). We claim that w satisfies the hypotheses of the Riesz Representation Theorem and so corresponds to integration with respect to a probability measure. (i) By construction, w is a linear mapping: w(λf + µg) = λw(f ) + µw(g).

(ii) As |w(f )| ≤ kf k∞, we see that w is bounded. (iii) If f ≥ 0 then it is easy to check that w(f ) ≥ 0. Hence w is positive.

(iv) It is easy to check that w is normalised: w(1) = 1. Therefore, by the Riesz Representation Theorem, there exists µ ∈ M(X) such that w(f ) = R R (n) R (n) f dµ. We then have that f dµn → f dµ, as n → ∞, for all f ∈ C(X, R), i.e., that µn converges weak∗ to µ, as n → ∞.

35 6 MEASURE PRESERVING TRANSFORMATIONS

6 Measure Preserving Transformations

6.1 Invariant measures Let (X, B, µ) be a probability space. A transformation T : X → X is said to be measurable if T −1B ∈ B for all B ∈ B. Definition 6.1. We say that T is a measure-preserving transformation (m.p.t.) or, equiva- lently, µ is said to be a T -, if µ(T −1B) = µ(B) for all B ∈ B. Remark 6.2. We write L1(X, B, µ) for the space of (equivalence classes of) all functions f : X → R that are integrable with respect to the measure µ, i.e.  Z  1 L (X, B, µ) = f : X → R | f is measurable and |f | dµ < ∞ .

Lemma 6.3. The following are equivalent: (i) T is a measure-preserving transformation;

(ii) for each f ∈ L1(X, B, µ), we have Z Z f dµ = f ◦ T dµ.

1 Proof. For B ∈ B, χB ∈ L (X, B, µ) and χB ◦ T = χT −1B, we have Z Z µ(B) = χB dµ = χB ◦ T dµ Z −1 = χT −1B dµ = µ(T B).

This proves one implication. Conversely, suppose that T is a measure-preserving transformation. For any characteristic function χB, B ∈ B, Z Z Z −1 χB dµ = µ(B) = µ(T B) = χT −1B dµ = χB ◦ T dµ and so the equality holds for any simple function (a finite linear combination of characteristic functions). Given any f ∈ L1(X, B, µ) with f ≥ 0, we can find an increasing sequence of simple functions fn with fn → f pointwise, as n → ∞. For each n we have Z Z fn dµ = fn ◦ T dµ and, applying the Monotone Convergence Theorem to both sides, we obtain Z Z f dµ = f ◦ T dµ.

To extend the result to general real-valued f , consider the positive and negative parts. This completes the proof.

36 6.2 Continuous transformations 6 MEASURE PRESERVING TRANSFORMATIONS

6.2 Continuous transformations We shall now concentrate on the special case where X is a compact metric space, B is the Borel σ-algebra and T is a continuous mapping (in which case T is measurable). The map T induces a mapping on the set of (Borel) probability measures M(X) as follows:

Definition 6.4. Define the induced mapping T∗ : M(X) → M(X) by

−1 (T∗µ)(B) = µ(T B).

(We call T∗µ the push-forward of µ by T .)

Exercise 6.5. Check that T∗µ is a probability measure.

Then µ is T -invariant if and only if T∗µ = µ. Write

M(X,T ) = {µ ∈ M(X) | T∗µ = µ}.

Lemma 6.6. For f ∈ C(X, R) we have Z Z f d(T∗µ) = f ◦ T dµ.

Proof. From the definition, for B ∈ B, Z Z χB d(T∗µ) = χB ◦ T dµ.

Thus the result also holds for simple functions. If f ∈ C(X, R) is such that f ≥ 0, we can choose an increasing sequence of simple functions fn converging to f pointwise. We have Z Z fn d(T∗µ) = fn ◦ T dµ and, applying the Monotone Convergence Theorem to each side, we obtain Z Z f d(T∗µ) = f ◦ T dµ.

The result extends to general real-valued f ∈ C(X, R) by considering positive and negative parts.

Lemma 6.7. Let T : X → X be a continuous mapping of a compact metric space. The following are equivalent:

(i) T∗µ = µ; (ii) for all f ∈ C(X, ) R Z Z f dµ = f ◦ T dµ.

37 6.3 Existence of invariant measures 6 MEASURE PRESERVING TRANSFORMATIONS

Proof. (i) ⇒ (ii): This follows from Lemma 6.3, since C(X, R) ⊂ L1(X, B, µ). (ii) ⇒ (i): Define two linear functionals w1, w2 : C(X, R) → R as follows: Z Z w1(f ) = f dµ, w2(f ) = f dT∗µ.

Note that both w1 and w2 are bounded positive normalised linear functionals on C(X, R). Moreover, by Lemma 6.6 Z Z Z w2(f ) = f dT∗µ = f ◦ T dµ = f dµ = w1(f )

so that w1 and w2 determine the same linear functional. By uniqueness in the Riesz Repre- sentation Theorem, this implies that T∗µ = µ. ∗ Exercise 6.8. Show that the map T∗ : M(X) → M(X) is continuous in the weak topology.

6.3 Existence of invariant measures Given a continuous mapping T : X → X of a compact metric space, it is natural to ask whether invariant measures necessarily exist, i.e., whether M(X,T ) 6= ∅. The next result shows that this is the case. Theorem 6.9. Let T : X → X be a continuous mapping of a compact metric space. Then there exists at least one T -invariant probability measure. Proof. Let σ ∈ M(X) be a probability measure (for example, we could take σ to be a Dirac measure). Define the sequence µn ∈ M(X) by n−1 1 X µ = T j σ, n n ∗ j=0 so that, for B ∈ B, 1 µ (B) = (σ(B) + σ(T −1B) + ··· + σ(T −(n−1)B)). n n ∗ Since M(X) is weak compact, some subsequence µnk converges, as k → ∞, to a measure µ ∈ M(X). We shall show that µ ∈ M(X,T ). By Lemma 6.7, this is equivalent to showing that Z Z f dµ = f ◦ T dµ ∀f ∈ C(X, R). To see this, note that Z Z Z Z

f ◦ T dµ − f dµ = lim f ◦ T dµn − f dµn k→∞ k k n −1 1 Z Xk = lim (f ◦ T j+1 − f ◦ T j ) dσ k→∞ nk j=0 Z 1 n = lim (f ◦ T k − f ) dσ k→∞ nk 2kf k ≤ lim ∞ = 0. k→∞ nk

38 6.4 Properties of M(X,T ) 6 MEASURE PRESERVING TRANSFORMATIONS

Therefore, µ ∈ M(X,T ), as required.

6.4 Properties of M(X,T ) We now know that M(X,T ) 6= ∅. The next result gives us some basic information about its structure.

Theorem 6.10. (i) M(X,T ) is convex: i.e. µ1, µ2 ∈ M(X,T ) ⇒ αµ1 + (1 − α)µ2 ∈ M(X,T ), for all 0 ≤ α ≤ 1.

(ii) M(X,T ) is weak∗ closed (and hence compact).

Proof. (i) If µ1, µ2 ∈ M(X,T ) and 0 ≤ α ≤ 1 then

−1 (αµ1 + (1 − α)µ2)(T B) −1 −1 = αµ1(T B) + (1 − α)µ2(T B)

= αµ1(B) + (1 − α)µ2(B) = (αµ1 + (1 − α)µ2)(B), so αµ1 + (1 − α)µ2 ∈ M(X,T ). (ii) Let µn be a sequence in M(X,T ) and suppose that µn * µ ∈ M(X), as n → ∞. For f ∈ C(X, ), R Z Z f dµn = f ◦ T dµn.

As n → ∞, the left-hand side converges to R f dµ and the right-hand side converges to R f ◦ T dµ. Hence R f dµ = R f ◦ T dµ and so, by Lemma 11.3, µ ∈ M(X,T ). This shows that M(X,T ) is closed. It is compact since it is a closed subset of the compact set M(X).

6.5 Simple examples We give two methods by which one can show that a given dynamical system preserves a given measure. We shall illustrate these two methods by proving that (i) a rotation of a torus, and (ii) the doubling map preserve Lebesgue measure. Let us first recall how these examples are defined.

6.5.1 Rotations on tori

Take X = Rk /Zk , the k-dimensional torus. Recall that Lebesgue measure µ is defined by first defining the measure of a k-dimensional cube [a1, b1] × · · · × [ak , bk ] to be

k ! k Y Y µ [aj , bj ] = (bj − aj ) j=1 j=1 and then extending this to the Borel σ-algebra by using the Kolmogorov Extension Theorem. k Fix α = (α1, . . . , αk ) ∈ R and define T : X → X by

T (x1, . . . , xk ) = (x1 + α1, . . . , xk + αk ) mod 1.

39 6.6 Kolmogorov Extension Theorem6 MEASURE PRESERVING TRANSFORMATIONS

(In multiplicative notation this becomes:

T (e2πiθ1 , . . . , e2πiθk ) = (e2πi(θ1+α1), . . . , e2πi(θk +αk )).)

k k This is the rotation of the k-dimensional torus R /Z by the vector (α1, . . . , αk ). In dimension k = 1 we get a rotation of a circle defined by

T : R/Z → R/Z : x 7→ x + α mod 1.

6.5.2 The doubling map

Let X = R/Z denote the circle. The doubling map is defined to be

T : R/Z → R/Z : x 7→ 2x mod 1.

6.6 Kolmogorov Extension Theorem Recall the Kolmogorov Extension Theorem:

Theorem 6.11 (Kolmogorov Extension Theorem). Let A be an algebra of subsets of X. Suppose that µ : A → R+ ∪ {∞} satisfies: (i) µ(∅) = 0; S (ii) there exists finitely or countably many sets Xn ∈ A such that X = n Xn and µ(Xn) < ∞; S∞ (iii) if En ∈ A, n ≥ 1, are pairwise disjoint and if n=1 En ∈ A then

∞ ! ∞ [ X µ En = µ(En). n=1 n=1

Then there is a unique measure µ : B(A) → R+ ∪ {∞} which is an extension of µ : A → R+ ∪ {∞}. That is, if something looks like a measure on an algebra A, then it extends uniquely to a measure defined on the σ-algebra B(A) generated by A.

Corollary 6.12. Let A be an algebra of subsets of X. Suppose that µ1 and µ2 are two measures on B(A) such that µ1(E) = µ2(E) for all E ∈ A. Then µ1 = µ2 on B(A). To show that a dynamical system T preserves a probability measure µ we have to show

that T∗µ = µ. By the above corollary, we see that it is sufficient to check that T∗µ = µ on an algebra that generates the σ-algebra. Recall that the collection of all finite unions of sub-intervals forms an algebra of subsets of both [0, 1] and R/Z that generates the Borel σ-algebra. Similarly, the collection of all finite

40 6.6 Kolmogorov Extension Theorem6 MEASURE PRESERVING TRANSFORMATIONS unions of k-dimensional sub-cubes of Rk /Zk forms an algebra of subsets of the k-dimensional torus Rk /Zk that generates the Borel σ-algebra. Thus to show that for a dynamical system T defined on R/Z preserves a measure µ we need only check that −1 T∗µ(a, b) = µT (a, b) = µ(a, b) for all subintervals (a, b).

6.6.1 Rotations of a circle We claim that the rotation T (x) = x + α mod 1 preserves Lebesgue measure µ. First note that T −1(a, b) = {x | T (x) ∈ (a, b)} = (a − α, b − α). Hence

−1 T∗µ(a, b) = µT (a, b) = µ(a − α, b − α) = (b − α) − (a − α) = b − a = µ(a, b).

Hence T∗µ = µ on the algebra of finite unions of subintervals. As this algebra generates the Borel σ-algebra, by uniqueness in the Kolmogorov Extension Theorem we see that T∗µ = µ; i.e. Lebesgue measure is T -invariant.

6.6.2 The doubling map We claim that the doubling map T (x) = 2x mod 1 preserves Lebesgue measure µ. First note that a b a + 1 b + 1 T −1(a, b) = {x | T (x) ∈ (a, b)} = , ∪ , . 2 2 2 2 Hence

−1 T∗µ(a, b) = µT (a, b) a b a + 1 b + 1 = µ , ∪ , 2 2 2 2 b a (b + 1) (a + 1) = − + − 2 2 2 2 = b − a = µ(a, b).

Hence T∗µ = µ on the algebra of finite unions of subintervals. As this semi-algebra generates the Borel σ-algebra, by uniqueness in the Kolmogorov Extension Theorem we see that T∗µ = µ; i.e. Lebesgue measure is T -invariant.

41 6.7 Fourier series 6 MEASURE PRESERVING TRANSFORMATIONS

6.7 Fourier series

Let B denote the Borel σ-algebra on R/Z and let µ be Lebesgue measure. Given a Lebesgue integrable function f ∈ L1(R/Z, B, µ), we can associate to f the Fourier series ∞ a0 X + (a cos 2πnx + b sin 2πnx) , 2 n n n=1 where Z 1 Z 1 an = 2 f (x) cos 2πnx dµ, bn = 2 f (x) sin 2πnx dµ. 0 0 (Notice that we are not claiming that the series converges—we are just formally associating the Fourier series to f .) We shall find it more convenient to work with a complex form of the Fourier series:

∞ X 2πinx cne , n=−∞ where Z 1 −2πinx cn = f (x)e dµ. 0 R 1 (In particular, c0 = 0 f dµ.) We are still not making any assumption as to (i) whether the series converges at all, or (ii) whether, if the series does converge, it converges to f (x). In general, answering these questions relies on the class of function to which f belongs. The weakest class of function is f ∈ L1(X, B, µ). In this case, we only know that the co- P∞ 2πinx efficients cn → 0 as |n| → ∞. Although this condition is clearly necessary for n=−∞ cne to converge, it is not sufficient, and there exist examples of functions f ∈ L1(X, B, µ) for which the series does not converge to f (x).

1 Lemma 6.13 (Riemann-Lebesgue Lemma). If f ∈ L (R/Z, B, µ) then cn → 0 as |n| → ∞, i.e.: Z 1 lim f (x)e2πinx dµ = 0. n→±∞ 0 It is of great interest and practical importance to know when and in what sense the Fourier series converges to the original function f . For convenience, we shall write the nth partial sum of a Fourier series as n X 2πi`x sn(x) = c`e `=−n and the average of the first n partial sums as 1 σ (x) = (s (x) + s (x) + ··· + s (x)) . n n 0 1 n−1 2 R 2 We define L (X, B, µ) to be the set of all functions f : X → R such that |f | dµ < ∞. Notice that L2 ⊂ L1.

42 6.7 Fourier series 6 MEASURE PRESERVING TRANSFORMATIONS

2 Theorem 6.14. (i) (Riesz-Fischer Theorem) If f ∈ L (R/Z, B, µ) then sn converges to f in L2, i.e., Z 2 |sn − f | dµ → 0, as n → ∞.

(ii) (Fejér’s Theorem) If f ∈ C(R/Z) then σn converges uniformly to f as n → ∞, i.e., kσn − f k∞ → 0, as n → ∞. In summary: Class of Property of Fourier Fourier series converges function coefficients to function 1 L cn → 0 Not in general 2 L partial sums sn converge Yes, sn → f (convergence in L2 sense)

continuous averages σn of partial Yes, σn → f sums converge (uniform convergence)

6.7.1 Rotations of a circle Let T (x) = x + α mod 1 be a circle rotation. We now give an alternative method of proving that µ is T -invariant using Fourier series. Recall Lemma 6.7: µ is T -invariant if and only if Z Z f ◦ T dµ = f dµ for all f ∈ C(X, R).

Heuristically, the argument is as follows. First note that Z  0, if n 6= 0 e2πinx dµ = 1, if n = 0.

If f ∈ C(X, ) has Fourier series P c e2πinx then f ◦T has Fourier series P c e2πinαe2πinx . R n∈Z n n∈Z n The underlying idea is the following: Z Z X 2πinα 2πinx f ◦ T dµ = cne e dµ n∈Z Z X 2πinα 2πinx = cne e dµ n∈Z Z = c0 = f dµ.

Notice that the above involves saying the ‘the integral of an infinite sum is the infinite sum of the integrals’. This is not necessarily the case, so to make this argument rigorous we need to use Theorem 7.4(ii) to justify this step. Let f ∈ C(X, R). Then f has a Fourier series

X 2πinx cne . n∈Z

43 6.7 Fourier series 6 MEASURE PRESERVING TRANSFORMATIONS

Let sn(x) denote the nth partial sum:

n X 2πi`x sn(x) = c`e . `=−n Then n X 2πi`α 2πi`x sn(T x) = c`e e `=−n and this is the nth partial sum for the Fourier series of f ◦T . As R e2πi`x dµ = 0 unless ` = 0, it follows that Z Z sn dµ = c0 = sn ◦ T dµ. Consider 1 σ (x) = (s + ··· + s )(x). n n 0 n−1

Then σn(x) → f (x) uniformly. Moreover, σn(T x) → f (T x) uniformly. Hence Z Z Z Z f dµ = lim σn dµ = c0 = lim σn ◦ T dµ = f ◦ T dµ n→∞ n→∞ and Lemma 6.7 implies that Lebesgue measure is invariant.

6.7.2 The doubling map Define T : X → X by T (x) = 2x mod 1. P 2πinx Heuristically, the argument is as follows: If f has Fourier series n cne then f ◦ T P 2πi2nx has Fourier series n cne . Hence Z Z X 2πi2nx f ◦ T dµ = cne dµ n Z X 2πi2nx = cn e dµ n = c0 Z = f dµ.

Again, this needs to be made rigorous, and the argument is similar to that above.

6.7.3 Higher dimensional Fourier series

Let X = Rk /Zk be the k-dimensional torus and let µ denote Lebesgue measure on X. Let 1 f ∈ L (X, B, µ) be an integrable function defined on the torus. For each n = (n1, . . . , nk ) ∈ k define Z Z −2πihn,xi cn = f (x)e dµ

44 6.8 The continued fraction map 6 MEASURE PRESERVING TRANSFORMATIONS

where hn, xi = n1x1 + ··· + nk xk . Then we can associate to f the Fourier series:

X 2πihn,xi cne , n∈Zk

where n = (n1, . . . , nk ), x = (x1, . . . , xk ). Essentially the same convergence results hold as in the case k = 1, provided that we write

n n X X 2πih`,xi sn(x) = ··· c`e .

`1=−n `k =−n

Exercise 6.15. For an integer k ≥ 2 define T : R/Z → R/Z by T (x) = kx mod 1. Show that T preserves Lebesgue measure. Exercise 6.16. Let β > 1 denote the golden ratio (so that β2 = β + 1). Define T : [0, 1] → [0, 1] by T (x) = βx mod 1. Show that T does not preserve Lebesgue measure. Define the R measure µ by µ(B) = B k(x) dx where  1 on [0, 1/β) 1 + 1  β β3 k(x) = 1   on [1/β, 1).  β 1 + 1 β β3 By using the Kolmogorov Extension Theorem, show that T preserves µ. Exercise 6.17. Define the T : [0, 1] → [0, 1] by T (x) = 4x(1 − x). Define the measure µ by 1 Z 1 µ(B) = p dx. π B x(1 − x) (i) Check that µ is a probability measure.

(ii) By using the Kolmogorov Extension Theorem, show that T preserves µ.

6.8 The continued fraction map Recall that the continued fraction map T : [0, 1) → [0, 1) is defined by

 0 if x = 0, T (x) =  1 1 x = x mod 1 if 0 < x < 1. One can easily show that the continued fraction map does not preserve Lebesgue measure, i.e. there exists B ∈ B such that T −1B and B have different measure. (Indeed, choose B to be any interval.) Although the continued fraction map does not preserve Lebesgue measure, it does preserve Gauss’ measure µ, defined by 1 Z 1 µ(B) = dx. log 2 B 1 + x

45 6.8 The continued fraction map 6 MEASURE PRESERVING TRANSFORMATIONS

Remark 6.18. Two measures are said to be equivalent if they have the same sets of measure zero. Gauss’ measure and Lebesgue measure are equivalent. This means that any property that holds for µ-almost every point also holds for Lebesgue almost every point. This remark will have applications later when we use Birkhoff’s Ergodic Theorem to describe properties of the continued fraction expansion for typical (i.e. Lebesgue almost every) points. Proof. Using the Kolmogorov Extension Theorem argument, we only have to check that µ(T −1I) = µ(I) for intervals. If I = (a, b) then

∞ [  1 1  T −1(a, b) = , . b + n a + n n=1 Thus

µ(T −1(a, b)) ∞ 1 1 X Z a+n 1 = dx log 2 1 1 + x n=1 b+n ∞ 1 X   1   1  = log 1 + − log 1 + log 2 a + n b + n n=1 ∞ 1 X = [log(a + n + 1) − log(a + n) − log(b + n + 1) + log(b + n)] log 2 n=1 N 1 X = lim [log(a + n + 1) − log(a + n) − log(b + n + 1) + log(b + n)] N→∞ log 2 n=1 1 = lim [log(a + N + 1) − log(a + 1) − log(b + N + 1) + log(b + 1)] log 2 N→∞ 1  a + N + 1 = log(b + 1) − log(a + 1) + lim log log 2 N→∞ b + N + 1 1 = (log(b + 1) − log(a + 1)) log 2 1 Z b 1 = dx = µ(a, b), log 2 a 1 + x as required.

Exercise 6.19. Define the map T : [0, 1] → [0, 1] by

 x 1−x if 0 ≤ x ≤ 1/2 T (x) = 1−x x if 1/2 ≤ x ≤ 1. Define the measure µ on [0, 1] by Z dx µ(B) = B x (note that the measure µ is not a probability measure as µ([0, 1]) = ∞).

46 6.9 Linear toral endomorphisms 6 MEASURE PRESERVING TRANSFORMATIONS

(i) Show that µ([a, b]) = log b − log a.

(ii) Show that  a b   1 1  T −1[a, b] = , ∪ , . 1 + a 1 + b 1 + a 1 + b

(iii) Show that µ is T -invariant.

(iv) Define h : [0, 1] → [0, ∞] by 1 h(x) = − 1. x Define S = hT h−1 : [0, ∞] → [0, ∞] (so that S and T are topologically conjugate—i.e. they have the same dynamics). Show that we have

 x − 1 if 1 ≤ x < ∞ S(x) = 1 x − 1 if 0 ≤ x < 1. Relate the map S to continued fractions.

6.9 Linear toral endomorphisms

Let T : Rk /Zk → Rk /Zk be a linear toral endomorphism. Recall that this means that T is given as follows:

T (x1, . . . , xk ) = A(x1, . . . , xk ) mod 1 where A = (ai,j ) is a k × k matrix with entries in Z and with det A 6= 0. We shall show that µ is T -invariant by using Fourier series.

6.9.1 Fourier series in higher dimensions

Let X = Rk /Zk be the k-dimensional torus and let µ denote Lebesgue measure on X. Let 1 f ∈ L (X, B, µ) be an integrable function defined on the torus. For each n = (n1, . . . , nk ) ∈ k define Z Z −2πihn,xi cn = f (x1, . . . , xk )e dµ where hn, xi = n1x1 + ··· + nk xk . Then we can associate to f the Fourier series:

X 2πihn,xi cne , n∈Zk where n = (n1, . . . , nk ), x = (x1, . . . , xk ). Essentially the same convergence results hold as in the case k = 1, provided that we write

n n X X 2πih`,xi sn(x) = ··· c`e .

`1=−n `k =−n

47 6.9 Linear toral endomorphisms 6 MEASURE PRESERVING TRANSFORMATIONS

As in the one-dimensional case, we have that Z c0 = f dµ, and Z  0 if n 6= (0,..., 0) e2πihn,xi dµ = 1 if n = (0,..., 0).

6.9.2 Lebesgue measure is an invariant measure for a toral endomorphism Let µ denote Lebesgue measure. To show that µ is T -invariant, it is sufficient to prove that for each continuous function f ∈ C(X, R) we have Z Z f ◦ T dµ = f dµ.

We associate to such an f its Fourier series:

X 2πihn,xi cne . n∈Zk Then f ◦ T has Fourier series X 2πihn,Axi cne . n∈Zk Intuitively, we can write Z Z X 2πihn,Axi f ◦ T dµ = cne dµ n∈Zk Z X 2πihnA,xi = cne dµ n∈Zk Z X 2πihnA,xi = cn e dµ. n∈Zk Using the fact that det A 6= 0, we see that nA = 0 if and only if n = 0. Hence, all of the integrals above are 0, except for the term corresponding to n = 0. Hence Z Z f ◦ T dµ = c0 = f dµ.

(This argument can be made rigorous as for circle rotations.) Therefore, by Lemma 6.7, µ is T -invariant. Exercise 6.20. Fix α ∈ R and define the map T : R2/Z2 → R2/Z2 by T (x, y) = (x + α, x + y).

By using Fourier Series, sketch a proof that Lebesgue measure is T -invariant.

48 6.10 Shifts of finite type 6 MEASURE PRESERVING TRANSFORMATIONS

6.10 Shifts of finite type Let A be a k × k matrix with entries equal to 0 or 1. Recall that we have defined the (two-sided) shift of finite type by

Z ΣA = {x = (xn) ∈ {1, . . . , k} | A(xn, xn+1) = 1 ∀n ∈ Z} and the (one-sided) shift of finite type

+ + Z + ΣA = {x = (xn) ∈ {1, . . . , k} | A(xn, xn+1) = 1 ∀n ∈ Z }. + + The shift maps σ :ΣA → ΣA, σ :ΣA → ΣA are defined by

σ(. . . , x1, x0 , x1, x2,...) = (. . . , x0, x1 , x2, x3,...), |{z} |{z} 0th place 0th place

σ(x0, x1, x2, x3,...) = (x1, x2, x3,...), respectively, i.e., σ shifts sequences one place to the left. As an analogue of intervals in this case, we have so-called ‘cylinder sets’, formed by fixing a finite set of co-ordinates. More precisely, in ΣA we define

[y−m, . . . , y−1, y0, y1, . . . , yn] = {x ∈ ΣA | xi = yi , −m ≤ i ≤ n}, + and in ΣA we define + [y0, y1, . . . , yn] = {x ∈ ΣA | xi = yi , 0 ≤ i ≤ n}. In each case the cylinder sets form a semi-algebra which generates the Borel σ-algebra. (Cylinder sets are both open and closed.)

6.10.1 The Perron-Frobenius theorem The following standard result will be useful. Theorem 6.21 (Perron-Frobenius). Let B be a non-negative aperiodic k × k matrix (i.e. n Bi,j ≥ 0 for each 1 ≤ i, j ≤ k and there exists n > 0 such that Bi,j > 0 for all 0 ≤ i, j ≤ k). Then:

(i) there exists a positive eigenvalue λ > 0 such that all other eigenvalues λi ∈ C satisfy |λi | < λ, (ii) the eigenvalue λ is simple (i.e. the corresponding eigenspace is one-dimensional),

T Pn (iii) there is a unique right-eigenvector v = (v1, . . . , vk ) such that vj > 0, j=1 |vj | = 1 and Bv = λv, Pn (iv) there is a unique positive left-eigenvector u = (u1, . . . , uk ) such that uj > 0, j=1 |uj | = 1 and uB = λu, (v) eigenvectors corresponding to eigenvalues other than λ are not positive: i.e. at least one co-ordinate is positive and at least one co-ordinate is negative.

49 6.10 Shifts of finite type 6 MEASURE PRESERVING TRANSFORMATIONS

6.10.2 Markov measures We will now see how to construct a large class of σ-invariant measures on shifts of finite type. Definition 6.22. A k × k matrix P is said to be stochastic if: (i) P (i, j) ≥ 0 i, j = 1, . . . , k, Pk (ii) j=1 P (i, j) = 1, i = 1, . . . , k. Suppose that P is compatible with A, i.e.,

P (i, j) > 0 ⇐⇒ A(i, j) = 1.

Suppose in addition that P , or equivalently A, is aperiodic, i.e., there exists n such that for each i, j we have P n(i, j) > 0. Applying the Perron-Frobenius theorem, we see that there exists a unique maximal eigen- value λ. As P is stochastic, we must have that λ = 1 (why?). Moreover, by (ii) in the above definition, the right-eigenvector is (1,..., 1). Let p = (p1, . . . , pk ) be the corresponding Pk (strictly positive) left eigenvector, normalised so that i=1 pi = 1. + Now we define a probability measure µ = µP on ΣA, ΣA by

µP [y`, y`+1, . . . , yn] = py` P (y`, y`+1) ··· P (yn−1, yn), on cylinder sets. (By the Kolmogorov Extension Theorem, this uniquely defines a measure on the whole Borel σ-algebra.) + We shall show that the measure µP on ΣA is σ-invariant. By the Kolmogorov Extension Theorem, it is enough to show that µP and σ∗µP agree on cylinder sets. Now

−1 σ∗µP [y0, y1, . . . , yn] = µP (σ [y0, y1, . . . , yn]) k ! [ = µP [j, y0, y1, . . . , yn] j=1 k X = µP [j, y0, y1, . . . , yn] j=1 k X = pj P (j, y0)P (y0, y1) ··· P (yn−1, yn) j=1

= py0 P (y0, y1) ··· P (yn−1, yn)

= µP [y0, y1, . . . , yn], as required (where to get the penultimate line we have used the fact that pP = p). Probability measures of this form are called Markov measures. Given an aperiodic k ×k matrix A there are of course many compatible stochastic matrices P . Each such generates a different Markov measure. However, there are several naturally defined measures that turn out to be Markov, and we give two of them here.

50 6.10 Shifts of finite type 6 MEASURE PRESERVING TRANSFORMATIONS

6.10.3 Full shifts Recall that if A(i, j) = 1 for all i, j then

+ Z + Z ΣA = {1, . . . , k} , ΣA = {1, . . . , k} are the full shifts on k symbols. In this case we may define a (family of) measures by taking Pk p = (p1, . . . , pk ) to be any (positive) probability vector (i.e., pi > 0, i=1 pi = 1) and defining

µ[yl , . . . , yn] = pyl ··· pyn . Such a µ is called a Bernoulli measure. Exercise 6.23. Show that Bernoulli measures are Markov measures (i.e. construct a matrix

P for which (p1, . . . , pk )P = (p1, . . . , pk ).

6.10.4 The Parry measure As A is a non-negative aperiodic matrix, by the Perron-Frobenius theorem there exists a unique maximal eigenvalue λ with corresponding left and right eigenvectors u = (u1, . . . , uk ) and v = (v1, . . . , vk ), respectively. Define

Ai,j vj Pi,j = λvi u v p = i i i c Pk where c = i=1 ui vi . Exercise 6.24. Show that P is a stochastic matrix and that p is a normalised left-eigenvalue for P . The corresponding Markov measure is called the Parry measure.

51 7

7 Ergodicity

7.1 The definition of ergodicity In this section, we introduce what it means to say that a transformation is ergodic with respect to an invariant measure. Ergodicity is an important concept for many reasons, not least because Birkhoff’s Ergodic Theorem holds:

Theorem 7.1. Let T be an ergodic transformation of the probability space (X, B, µ) and let f ∈ L1(X, B, µ). Then n−1 1 X Z f (T j x) → f dµ n j=0 for µ-almost every x ∈ X.

Definition 7.2. Let (X, B, µ) be a probability space and let T : X → X be a measure- preserving transformation. We say that T is an ergodic transformation (or µ is an ergodic measure) if, for B ∈ B, T −1B = B ⇒ µ(B) = 0 or 1.

Remark 7.3. One can view ergodicity as an indecomposability condition. If ergodicity does not hold and we have T −1A = A with 0 < µ(A) < 1, then one can split T : X → X into 1 T : A → A and T : X \ A → X \ A with invariant probability measures µ(A) µ(· ∩ A) and 1 1−µ(A) µ(· ∩ (X \ A)), respectively. It will sometimes be convenient for us to weaken the condition T −1B = B to µ(T −1B4B) = 0, where 4 denotes the symmetric difference:

A4B = (A \ B) ∪ (B \ A).

The next lemma allows us to do this.

−1 −1 Lemma 7.4. If B ∈ B satisfies µ(T B4B) = 0 then there exists B∞ ∈ B with T B∞ = B∞ and µ(B4B∞) = 0. (In particular, µ(B) = µ(B∞).) Proof. For each j ≥ 0, we have the inclusion

j−1 [ T −j B4B ⊂ T −(i+1)B4T −i B i=0 j−1 [ = T −i (T −1B4B) i=0

and so (since T preserves µ)

µ(T −j B4B) ≤ jµ(T −1B4B) = 0.

52 7.2 An alternative characterisation of ergodicity 7 ERGODICITY

Let ∞ ∞ \ [ −i B∞ = T B. j=0 i=j We have that ∞ ! ∞ [ X µ B4 T −i B ≤ µ(B4T −i B) = 0. i=j i=j S∞ −i Since the sets i=j T B decrease as j increases we hence have µ(B4B∞) = 0. Also,

∞ ∞ −1 \ [ −(i+1) T B∞ = T B j=0 i=j ∞ ∞ \ [ −i = T B = B∞, j=0 i=j+1 as required.

Corollary 7.5. If T is ergodic and µ(T −1B4B) = 0 then µ(B) = 0 or 1. Remark 7.6. Occasionally, instead of saying that µ(A4B) = 0, we will say that A = B a.e. or A = B mod 0.

7.2 An alternative characterisation of ergodicity The next result characterises ergodicity in a convenient way.

Proposition 7.7. Let T be a measure-preserving transformation of (X, B, µ). The following are equivalent: (i) T is ergodic;

(ii) whenever f ∈ L1(X, B, µ) satisfies f ◦ T = f µ-a.e. we have that f is constant µ-a.e. Remark 7.8. We can replace L1 in the statement by measurable or by L2. Proof. (i) ⇒ (ii): Suppose that T is ergodic and that f ∈ L1(X, B, µ) with f ◦ T = f µ-a.e. For k ∈ Z and n ∈ N, define  k k + 1  k k + 1 X(k, n) = x ∈ X | ≤ f (x) < = f −1 , . 2n 2n 2n 2n

Since f is measurable, X(k, n) ∈ B. We have that

T −1X(k, n)4X(k, n) ⊂ {x ∈ X | f (T x) 6= f (x)} so that µ(T −1X(k, n)4X(k, n)) = 0.

53 7.3 Rotations of a circle 7 ERGODICITY

Hence µ(X(k, n)) = 0 or µ(X(k, n)) = 1. For each fixed n, the union S X(k, n) is equal to X up to a set of measure zero, i.e., k∈Z ! [ µ X4 X(k, n) = 0, k∈Z and this union is disjoint. Hence we have X µ(X(k, n)) = µ(X) = 1 k∈Z

and so there is a unique kn for which µ(X(kn, n)) = 1. Let

∞ \ Y = X(kn, n). n=1 Then µ(Y ) = 1 and, by construction, f is constant on Y , i.e., f is constant µ-a.e. −1 1 (ii) ⇒ (i): Suppose that B ∈ B with T B = B. Then χB ∈ L (X, B, µ) and χB ◦T (x) = χB(x) ∀ x ∈ X, so, by hypothesis, χB is constant µ-a.e. Since χB only takes the values 0 and 1, we must have χB = 0 µ-a.e. or χB = 1 µ-a.e. Therefore Z µ(B) = χB dµ = 0 or 1, X and T is ergodic.

7.3 Rotations of a circle

Fix α ∈ R and define T : R/Z → R/Z by T (x) = x + α mod 1. We have already seen that T preserves Lebesgue measure.

Theorem 7.9. Let T (x) = x + α mod 1.

(i) If α ∈ Q then T is not ergodic.

(ii) If α 6∈ Q then T is ergodic.

Proof. Suppose that α ∈ Q and write α = p/q for p, q ∈ Z with q 6= 0. Define

f (x) = e2πiqx ∈ L2(X, B, µ).

Then f is not constant but

f (T x) = e2πiq(x+p/q) = e2πi(qx+p) = e2πiqx = f (x).

Hence T is not ergodic.

54 7.4 The doubling map 7 ERGODICITY

Suppose that α 6∈ Q. Suppose that f ∈ L2(X, B, µ) is such that f ◦ T = f a.e. Suppose that f has Fourier series ∞ X 2πinx cne . n=−∞ Then f ◦ T has Fourier series ∞ X 2πinα 2πinx cne e . n=−∞ Comparing Fourier coefficients we see that

2πinα cn = cne ,

2πinα for all n ∈ Z. As α 6∈ Q, e 6= 1 unless n = 0. Hence cn = 0 for n 6= 0. Hence f has Fourier series c0, i.e. f is constant a.e.

Exercise 7.10. Show that, when α ∈ Q, the rotation T (x) = x +α mod 1 is not ergodic from the definition, i.e. find an invariant set B = T −1B which has Lebesgue measure 0 < µ(B) < 1.

7.4 The doubling map

Let X = R/Z and define T : X → X by T (x) = 2x mod 1. Proposition 7.11. The doubling map T is ergodic with respect to Lebesgue measure µ.

Proof. Let f ∈ L2(R/Z, B, µ) and suppose that f ◦ T = f µ-a.e. Let f have Fourier series

∞ X 2πimx 2 f (x) = ame (in L ). m=−∞

For each j ≥ 0, f ◦ T j has Fourier series

∞ X 2πim2j x ame . m=−∞ Comparing Fourier coefficients we see that

am = a2j m for all m ∈ Z and each j = 0, 1, 2,.... The Riemann-Lebesgue Lemma says that an → 0 as |n| → ∞. Hence, if m 6= 0, we have that am = a2j m → 0 as j → ∞. Hence for m 6= 0 we have that am = 0. Thus f has Fourier series a0, and so must be equal to a constant a.e. Hence T is ergodic with respect to µ.

55 7.5 Linear toral automorphisms 7 ERGODICITY

7.5 Linear toral automorphisms

Let X = Rk /Zk and let µ denote Lebesgue measure. Let A be a k × k integer matrix with det A = ±1 and define T : X → X by

T (x1, . . . , xk ) = A(x1, . . . , xk ) mod 1. Proposition 7.12. A linear toral automorphism T is ergodic with respect to µ if and only if no eigenvalue of A is a . Remark 7.13. In particular, hyperbolic toral automorphisms (i.e. no eigenvalues of modulus 1) are ergodic with respect to Lebesgue measure. To prove this criterion, we shall use the following: Lemma 7.14. Let T be a linear toral automorphism. The following are equivalent: (i) T is ergodic with respect to µ;

(ii) the only m ∈ Zk for which there exists p > 0 such that e2πihm,Apxi = e2πihm,xi µ-a.e.

is m = 0.

Proof. (i) ⇒ (ii): Suppose that T is ergodic and that there exists m ∈ Zk and p > 0 such that e2πihm,Apxi = e2πihm,xi µ-a.e. Let p be the least such exponent and define

p 1 f (x) = e2πihm,xi + e2πihm,Axi + ··· + e2πihm,A − xi p 1 = e2πihm,xi + e2πihmA,xi + ··· + e2πihmA − ,xi.

Then f ∈ L2(Rk /Zk , B, µ) and, since e2πihm,·i ◦T = e2πihm,A·i, we have f ◦T = f µ-a.e. Since T is ergodic, we thus have f = constant a.e. However, the only way that this can happen is if m = 0. (ii) ⇒ (i): Now suppose that if, for some m ∈ Zk and p > 0, we have e2πihm,Apxi = e2πihm,xi µ-a.e., then m = 0. Let f ∈ L2(Rk /Zk , B, µ) and suppose that f ◦ T = f µ-a.e. We shall show that T is ergodic by showing that f is constant µ-a.e. We may expand f as a Fourier series

X 2πihm,xi 2 f (x) = ame (in L ). m∈Zk Since f ◦ T p = f µ-a.e., for all p > 0, we have

X 2πihmAp,xi X 2πihm,xi ame = ame , m∈Zk m∈Zk

56 7.5 Linear toral automorphisms 7 ERGODICITY

for all p > 0. By the uniqueness of Fourier expansions, we can compare coefficients and obtain, for every m ∈ Zk , am = amA = ··· = amAp = ··· .

If am 6= 0 then there can only be finitely many indices in the above list, for otherwise it would contradict the fact that am → 0 as |m| → ∞. In other words, there exists p > 0 such that m = mAp.

2πihm,Apxi 2πihm,xi But then e = e and so, by hypothesis, m = 0. Thus am = 0 for all m 6= 0 and so f is equal to the constant a0 µ-a.e. Therefore T is ergodic. Proof of Proposition 7.12. We shall prove the contrapositive statements in each case. First suppose that T is not ergodic. Then, by the Lemma, there exists m ∈ Zk \{0} and p > 0 such that e2πihm,Apxi = e2πihm,xi, or, equivalently, e2πihmAp,xi = e2πihm,xi, which is to say that mAp = m. Thus Ap has 1 as an eigenvalue and hence A has a pth root of unity as an eigenvalue. Now suppose that A has a pth root of unity as an eigenvalue. Then Ap has 1 as an eigenvalue and so m(Ap − I) = 0 for some m ∈ Rk \{0}. Since A is an integer matrix, we can in fact take m ∈ Zk \{0}. We have e2πihm,Apxi = e2πihmAp,xi = e2πihm,xi, so, by the Lemma, T is not ergodic.

Exercise 7.15. Define T : R2/Z2 → R2/Z2 by T (x, y) = (x + α, x + y).

Suppose that α 6∈ Q. By using Fourier series, show that T is ergodic with respect to Lebesgue measure. Exercise 7.16. (This exercise is outside the scope of the course and so is not examinable.) It is easy to construct lots of examples of hyperbolic toral automorphisms (i.e. no eigenval- ues of modulus 1—the cat map is such an example), which must necessarily be ergodic with respect to Lebesgue measure. It is harder to show that there are ergodic toral automorphisms with some eigenvalues of modulus 1.

(i) Show that to have ergodic toral automorphism of Rk /Zk with an eigenvalue of modulus 1, we must have k ≥ 4. Consider the matrix  0 1 0 0   0 0 1 0  A =   .  0 0 0 1  −1 8 −6 8

57 7.6 Existence of ergodic measures 7 ERGODICITY

4 4 (ii) Show that A defines a linear toral automorphism TA of the 4-dimensional torus R /Z . (iii) Show that A has four eigenvalues, two of which have modulus 1.

(iv*) Show that TA is ergodic with respect to Lebesgue measure. (Hint: you have to show that the two eigenvalues of modulus 1 are not roots of unity, i.e. are not solutions to λn − 1 = 0 for some n. The best way to do this is to use results from Galois theory on the irreducibility of polynomials.)

7.6 Existence of ergodic measures We now return to the general theory of studying the structure of continuous transformations of compact metric spaces. Recall that we have already seen that the space M(X,T ) of T - invariant probability measures is always non-empty. We now describe how ergodic measures (for T ) fit in to the picture we have developed of M(X,T ). We shall then use this to show that, in this case, ergodic measures always exist.

Recall that M(X,T ) is convex: if µ1, µ2 ∈ M(X,T ) then αµ1 + (1 − α)µ2 ∈ M(X,T ) for every 0 ≤ α ≤ 1. A point in a convex set is called an extremal point if it cannot be written as a non-trivial convex combination of (other) elements of the set. More precisely, µ is an extremal point of M(X,T ) if whenever

µ = αµ1 + (1 − α)µ2, with µ1, µ2 ∈ M(X,T ), 0 < α < 1 then we have µ1 = µ2 = µ. Remarks 7.17. (i) Let Y be the unit square

2 Y = {(x, y) | 0 ≤ x ≤ 1, 0 ≤ y ≤ 1} ⊂ R .

Then the extremal points of Y are the corners (0, 0), (0, 1), (1, 0), (1, 1). (ii) Let Y be the (closed) unit disc

2 2 2 Y = {(x, y) : x + y ≤ 1} ⊂ R .

Then the set of extremal points of Y is precisely the unit circle {(x, y) | x 2 + y 2 = 1}. The next result will allow us to show that ergodic measures for continuous transformations on compact metric spaces always exist.

Theorem 7.18. The following are equivalent:

(i) the T -invariant probability measure µ is ergodic;

(ii) µ is an extremal point of M(X,T ).

58 7.6 Existence of ergodic measures 7 ERGODICITY

Proof. For the moment, we shall only prove (ii) ⇒ (i): if µ is extremal then it is ergodic. In fact, we shall prove the contrapositive. Suppose that µ is not ergodic; we show that µ is not extremal. As µ is not ergodic, there exists B ∈ B such that T −1B = B and 0 < µ(B) < 1. Define probability measures µ1 and µ2 on X by µ(A ∩ B) µ(A ∩ (X \ B)) µ (A) = , µ (A) = . 1 µ(B) 2 µ(X \ B)

(The condition 0 < µ(B) < 1 ensures that the denominators are not equal to zero.) Clearly,

µ1 6= µ2, since µ1(B) = 1 while µ2(B) = 0. Since T −1B = B, we also have T −1(X \ B) = X \ B. Thus we have

µ(T −1A ∩ B) µ (T −1A) = 1 µ(B) µ(T −1A ∩ T −1B) = µ(B) µ(T −1(A ∩ B)) = µ(B) µ(A ∩ B) = µ(B)

= µ1(A)

and (by the same argument)

µ(T −1A ∩ (X \ B)) µ (T −1A) = = µ (A), 2 µ(X \ B) 2

i.e., µ1 and µ2 are both in M(X,T ). However, we may write µ as the non-trivial (since 0 < µ(B) < 1) convex combination

µ = µ(B)µ1 + (1 − µ(B))µ2,

so µ is not extremal. We defer the proof of (i) ⇒ (ii) until later (as an appendix to section 9) as it requires the Radon-Nikodym Theorem, which we have yet to state.

Theorem 7.19. Let T : X → X be a continuous mapping of a compact metric space. Then there exists at least one ergodic measure in M(X,T ).

Proof. By Theorem 7.18, it is equivalent to prove that M(X,T ) has an extremal point. ∞ Choose a countable dense subset of C(X, R), {fi }i=0 say. Consider the first function f0. Since the map Z M(X,T ) → R : µ 7→ f0 dµ

59 7.6 Existence of ergodic measures 7 ERGODICITY

is (weak∗) continuous and M(X,T ) is compact, there exists (at least one) ν ∈ M(X,T ) such that Z Z f0 dν = sup f0 dµ. µ∈M(X,T ) If we define ( Z Z ) M0 = ν ∈ M(X,T ) | f0 dν = sup f0 dµ µ∈M(X,T ) then the above shows that M0 is non-empty. Also, M0 is closed and hence compact. We now consider the next function f1 and define  Z Z  M1 = ν ∈ M0 | f1 dν = sup f1 dµ . µ∈M0

By the same reasoning as above, M1 is a non-empty closed subset of M0. Continuing inductively, we define

( Z Z ) Mj = ν ∈ Mj−1 | fj dν = sup fj dµ µ∈Mj−1 and hence obtain a nested sequence of sets

M(X,T ) ⊃ M0 ⊃ M1 ⊃ · · · ⊃ Mj ⊃ · · · with each Mj non-empty and closed. Now consider the intersection ∞ \ M∞ = Mj . j=0 Recall that the countable intersection of a decreasing sequence of non-empty compact sets is non-empty. Hence M∞ is non-empty and we can pick µ∞ ∈ M∞. We shall show that µ∞ is extremal (and hence ergodic).

Suppose that we can write µ∞ = αµ1 + (1 − α)µ2, µ1, µ2 ∈ M(X,T ), 0 < α < 1. We ∞ have to show that µ1 = µ2. Since {fj }j=0 is dense in C(X, R), it suffices to show that Z Z fj dµ1 = fj dµ2 ∀ j ≥ 0.

Consider f0. By assumption Z Z Z f0 dµ∞ = α f0 dµ1 + (1 − α) f0 dµ2.

In particular, Z Z Z  f0 dµ∞ ≤ max f0 dµ1, f0 dµ2 .

60 7.6 Existence of ergodic measures 7 ERGODICITY

However µ∞ ∈ M0 and so Z Z Z Z  f0 dµ∞ = sup f0 dµ ≥ max f0 dµ1, f0 dµ2 . µ∈M(X,T )

Therefore Z Z Z f0 dµ1 = f0 dµ2 = f0 dµ∞.

Thus, the first identity we require is proved and µ1, µ2 ∈ M0. This last fact allows us to employ the same argument on f1 (with M(X,T ) replaced by M0) and conclude that Z Z Z f1 dµ1 = f1 dµ2 = f1 dµ∞ and µ1, µ2 ∈ M1. Continuing inductively, we show that for an arbitrary j ≥ 0, Z Z fj dµ1 = fj dµ2 and µ1, µ2 ∈ Mj . This completes the proof. Exercise 7.20. Define the North-South Map as follows. Let X be the circle of radius 1 centred at (0, 1) in R2. Call (0, 2) the North Pole (N) and (0, 0) the South Pole (S) of X. Define a map φ : X \{N} → R×{0} by drawing a straight line through N and x and denoting by φ(x) the unique point on the x-axis that this line crosses (this is just stereographic projection of the circle). Define T : X → X by

 φ−1 1 φ(x) if x ∈ X \{N}, T (x) = 2 N if x = N.

Hence T (N) = N, T (S) = S and if x 6= N,S then T n(x) → S as n → ∞.

(i) Show that δS and δN (the Dirac delta measures at N, S, respectively) are T -invariant.

(ii) Show that any invariant measure assigns zero measure to X \{N,S}. −n (Hint: take x 6= N,S and consider the interval I = [x, T (x)). Then ∪n∈ZT I is a disjoint union.)

(iii) Hence find all invariant measures for the North-South map.

(iv) Find all ergodic measures for the North-South map.

61 7.7 Bernoulli Shifts 7 ERGODICITY

7.7 Bernoulli Shifts

Let σ :Σk → Σk be the full shift on k symbols. (The following discussion works equally well + for the one-sided version Σk .) Let p = (p1, . . . , pk ) be a probability vector and let µp be the Bernoulli measure determined by p, i.e., on cylinder sets

µp[z0, . . . , zn−1] = pz0 ··· pzn−1 .

We shall show that σ is ergodic with respect to µp. We shall use the following fact: Given B ∈ B and ε > 0 we can find a finite disjoint collection of cylinder sets C1,...,CN such that N ! [ µp B4 Cj < ε. j=1 −1 Suppose that B ∈ B satisfies σ B = B. Choosing C1,...,CN as above and writing SN E = j=1 Cj , we have |µp(B) − µp(E)| < ε. The key point is the following: If n is sufficiently large then F = σ−nE and E depend on different co-ordinates. Hence, since µp is defined by a product,

µp(F ∩ E) = µp(F ) µp(E) −n = µp(σ E) µp(E) 2 = µp(E) , since µp is σ-invariant. We also have the estimate

−n −n µp(B4F ) = µp(σ B4σ E) −n = µp(σ (B4E))

= µp(B4E) < ε. Since B4(E ∩ F ) ⊂ (B4E) ∪ (B4F ), we therefore obtain

µp(B4(E ∩ F )) ≤ µp(B4E) + µp(B4F ) < 2ε and hence

|µp(B) − µp(E ∩ F )| < 2ε. Hence we can estimate

2 2 |µp(B) − µp(B) | ≤ |µp(B) − µp(E ∩ F )| + |µp(E ∩ F ) − µp(B) | 2 2 < 2ε + |µp(E) − µp(B) |

= 2ε + (µp(E) + µp(B)) |µp(E) − µp(B)| | {z } | {z } ≤2 <ε ≤ 4ε.

2 Since ε > 0 is arbitrary, we have µp(B) = µp(B) . This is only possible if µp(B) = 0 or µp(B) = 1. Therefore σ is ergodic with respect to µp.

62 7.8 Remarks on the continued fraction map 7 ERGODICITY

Remark 7.21. For general subshifts of finite type σ :ΣA → ΣA, σ is ergodic with respect to the Markov measure µP if and only if the stochastic matrix P is irreducible (i.e., for each (i, j) there exists n > 0 such that P n(i, j) > 0). Exercise 7.22. We have seen that there are lots of (indeed, uncountably many) ergodic measures for the full one-sided two-shift. We can use this fact to show that there are uncountably many ergodic measures for the doubling map. + + Z + Let Σ2 = {0, 1} be the one-sided full shift on two symbols. Define π :Σ2 → R/Z by x x x π(x , x ,...) = 0 + 1 + ··· + n + ··· 0 1 2 22 2n+1 (i) Show that π is continuous.

(ii) Let T : R/Z → R/Z be the doubling map: T (x) = 2x mod 1. Show that π ◦σ = T ◦π. (Thus T is a factor of σ.)

+ (iii) If µ is a σ-invariant probability measure on Σ2 , show that π∗µ (where π∗µ(B) = µ(π−1B) for a Borel subset B ⊂ R/Z) is a T -invariant probability measure on R/Z. (Lebesgue measure on R/Z corresponds to choosing µ to be the Bernoulli (1/2, 1/2)- + measure on Σ2 .)

(iv) Show that if µ is an ergodic measure for σ, then π∗µ is an ergodic measure for T . (v) Conclude that there are infinitely many ergodic measures for the doubling map.

7.8 Remarks on the continued fraction map Recall that the continued fraction map T : [0, 1) → [0, 1) is defined by T (0) = 0 and, for 0 < x < 1,  0 if x = 0, T (x) =  1 1 x = x mod 1 if 0 < x < 1. This sends the point 1 x = 1 x0 + 1 x1+ x + 1 2 x3+··· to the point 1 T (x) = 1 . x1 + 1 x2+ x + 1 3 x4+··· That is, T acts by shifting the continued fraction expansion one place to the left (and forgetting the 0th term). Thus we can think of T as a full one-sided subshift, albeit with an infinite number of symbols. Using this analogy, we can then define a cylinder to be a set of the form

I(x0, x1, . . . , xn) = {x ∈ (0, 1) | x has continued fraction expansion

starting x0, x1, . . . xn}.

63 7.8 Remarks on the continued fraction map 7 ERGODICITY

Recall that T preserves Gauss’ measure, defined by 1 Z dx µ(B) = . log 2 B 1 + x We claim that µ is an ergodic measure. One proof of this uses similar ideas as the proof that Bernoulli measures for subshifts of finite type are ergodic. However, a crucial property of Bernoulli measures that was used is the following: given two cylinders, E and F , we have µ(E ∩ σ−nF ) = µ(E)µ(F ) provided n is sufficiently large. This equality holds because the formula for the Bernoulli measure of a cylinder is ‘locally constant’, i.e. it depends only on a finite number of co-ordinates. The formula for Gauss’ measure is not locally constant: the function 1/(1 + x) depends on all (i.e. infinitely many) co-efficients in the continued fraction expansion of x. However, with some effort, one can prove that there exist constants c, C > 0 such that cµ(E)µ(F ) ≤ µ(E ∩ σ−nF ) ≤ Cµ(E)µ(F ) for ‘cylinders’ for the continued fraction map. It turns out that this is sufficient to prove ergodicity. In summary:

Proposition 7.23. Let T denote the continued fraction map. Then T is ergodic with respect to Gauss’ measure.

64 8 RECURRENCE AND UNIQUE ERGODICITY

8 Recurrence and Unique Ergodicity

8.1 Poincaré’s Recurrence Theorem We now go back to the general setting of a measure-preserving transformation of a probability space (X, B, µ). The following is the most basic result about the distribution of orbits.

Theorem 8.1 (Poincaré’s Recurrence Theorem). Let T : X → X be a measure-preserving transformation of (X, B, µ) and let A ∈ B have µ(A) > 0. Then for µ-a.e. x ∈ A, the orbit n ∞ {T x}n=0 returns to A infinitely often. Proof. Let E = {x ∈ A | T nx ∈ A for infinitely many n}, then we have to show that µ(A\E) = 0. If we write F = {x ∈ A | T nx 6∈ A ∀n ≥ 1} then we have the identity ∞ [ A \ E = (T −k F ∩ A). k=0 Thus we have the estimate

∞ ! [ µ(A\E) = µ (T −k F ∩ A) k=0 ∞ ! [ ≤ µ T −k F k=0 ∞ X ≤ µ(T −k F ). k=0

Since µ(T −k F ) = µ(F ) ∀k ≥ 0 (because the measure is preserved), it suffices to show that µ(F ) = 0. First suppose that n > m and that T −mF ∩ T −nF 6= ∅. If y lies in this intersection then T my ∈ F and T n−m(T my) = T ny ∈ F ⊂ A, which contradicts the definition of F . Thus T −mF and T −nF are disjoint. −k ∞ Since {T F }k=0 is a disjoint family, we have

∞ ∞ ! X [ µ(T −k F ) = µ T −k F ≤ µ(X) = 1. k=0 k=0 Since the terms in the summation have the constant value µ(F ), we must have µ(F ) = 0. Exercise 8.2. Construct an example to show that Poincaré’s recurrence theorem does not hold on infinite measure spaces. (Recall that a measure space (X, B, µ) is infinite if µ(X) = ∞.)

65 8.2 Unique Ergodicity 8 RECURRENCE AND UNIQUE ERGODICITY

8.2 Unique Ergodicity We finish this section by looking at the case where T : X → X has a unique invariant probability measure.

Definition 8.3. Let (X, B) be a measurable space and let T : X → X be a measurable transformation. If there is a unique T -invariant probability measure then we say that T is uniquely ergodic.

Remark 8.4. You might wonder why we don’t just call such T ‘uniquely invariant’. Recall from Theorem 8.18 that the extremal points of M(X,T ) are precisely the ergodic measures. If M(X,T ) consists of just one measure then that measure is extremal, and so must be ergodic. Unique ergodicity (for continuous maps) implies the following strong convergence result.

Theorem 8.5. Let X be a compact metric space and let T : X → X be a continuous transformation. The following are equivalent:

(i) T is uniquely ergodic;

(ii) for each f ∈ C(X) there exists a constant c(f ) such that

n−1 1 X f (T j x) → c(f ), n j=0

uniformly for x ∈ X, as n → ∞.

Proof. (ii) ⇒ (i): Suppose that µ, ν are T -invariant probability measures; we shall show that µ = ν. Integrating the expression in (ii), we obtain

n−1 Z 1 X Z f dµ = lim f ◦ T j dµ n→∞ n j=0 n−1 Z 1 X = lim f ◦ T j dµ n→∞ n j=0 Z = c(f ) dµ = c(f ),

and, by the same argument Z f dν = c(f ).

Therefore Z Z f dµ = f dν ∀f ∈ C(X) and so µ = ν (by the Riesz Representation Theorem).

66 8.3 Example: The Irrational Rotation 8 RECURRENCE AND UNIQUE ERGODICITY

(i) ⇒ (ii): Let M(X,T ) = {µ}. If (ii) is true, then, by the Dominated Convergence Theorem, we must necessarily have c(f ) = R f dµ. Suppose that (ii) is false. Then we can find f ∈ C(X) and sequences nk ∈ N and xk ∈ X such that

nk −1 Z 1 X j lim f (T xk ) 6= f dµ. k→∞ nk j=0

For each k ≥ 1, define a measure νk ∈ M(X) by

nk −1 1 X j νk = T∗δxk , nk j=0 so that Z nk −1 1 X j f dνk = f (T xk ). nk j=0 ∗ By the proof of Theorem 13.1, νk has a subsequence which converges weak to a measure ν ∈ M(X,T ). In particular, we have Z Z Z f dν = lim f dνk 6= f dµ. k→∞ Therefore, ν 6= µ, contradicting unique ergodicity.

8.3 Example: The Irrational Rotation

Let X = R/Z, T : X → X : x 7→ x + α mod 1, α irrational. Then T is uniquely ergodic (and µ = Lebesgue measure is the unique invariant probability measure). Proof. Let m be an arbitrary T -invariant probability measure; we shall show that m = µ. 2πikx Write ek (x) = e . Then Z Z ek (x) dm = ek (T x) dm Z = ek (x + α) dm Z 2πikα = e ek (x) dm.

Since α is irrational, if k 6= 0 then e2πikα 6= 1 and so Z ek dm = 0. (4)

P∞ R Let f ∈ C(X) have Fourier series k=−∞ ak ek , so that a0 = f dµ. For n ≥ 1, we let σn denote the average of the first n partial sums. Then σn → f uniformly as n → ∞. Hence Z Z lim σn dm = f dm. n→∞

67 8.3 Example: The Irrational Rotation 8 RECURRENCE AND UNIQUE ERGODICITY

However using (4), we may calculate that Z Z σn dm = a0 = f dµ.

Thus we have that R f dm = R f dµ, for every f ∈ C(X), and so m = µ.

68 9 BIRKHOFF’S ERGODIC THEOREM

9 Birkhoff’s Ergodic Theorem

9.1 Introduction An ergodic theorem is a result that describes the limiting behaviour of the sequence

n−1 1 X f ◦ T j (5) n j=0 as n → ∞. The precise formulation of an ergodic theorem depends on the class of function f (for example, one could assume that f is integrable, L2, or continuous), and the notion of convergence that we use (for example, we could study pointwise convergence, L2 conver- gence, or uniform convergence). The result that we are interested here—Birkhoff’s Ergodic Theorem—deals with pointwise convergence of (5) for an integrable function f .

9.2 We will need the concepts of Radon-Nikodym derivates and conditional expectation. Definition 9.1. Let µ be a measure on (X, B). We say that a measure ν is absolutely continuous with respect to µ and write ν  µ if ν(B) = 0 whenever µ(B) = 0, B ∈ B. Remark 9.2. Thus ν is absolutely continuous with respect to µ if sets of µ-measure zero also have ν-measure zero (but there may be more sets of ν-measure zero). For example, let f ∈ L1(X, B, µ) be non-negative and define a measure ν by Z ν(B) = f dµ. B Then ν  µ. The following theorem says that, essentially, all absolutely continuous measures occur in this way. Theorem 9.3 (Radon-Nikodym). Let (X, B, µ) be a probability space. Let ν be a measure defined on B and suppose that ν  µ. Then there is a non-negative measurable function f such that Z ν(B) = f dµ, for all B ∈ B. B Moreover, f is unique in the sense that if g is a measurable function with the same property then f = g µ-a.e. Exercise 9.4. If ν  µ then it is customary to write dν/dµ for the function given by the Radon-Nikodym theorem, that is Z dν ν(B) = dµ. B dµ Prove the following relations:

69 9.2 Conditional expectation 9 BIRKHOFF’S ERGODIC THEOREM

(i) If ν  µ and f is a µ-integrable function then Z Z dν f dν = f dµ. dµ

(ii) If ν1, ν2  µ then d(ν + ν ) dν dν 1 2 = 1 + 2 . dµ dµ dµ (iii) If λ  ν  µ then dλ dλ dν = . dµ dν dµ Let A ⊂ B be a sub-σ-algebra. Note that µ defines a measure on A by restriction. Let f ∈ L1(X, B, µ), with f non-negative. Then we can define a measure ν on A by setting Z ν(A) = f dµ. A

Note that ν  µ|A. Hence by the Radon-Nikodym theorem, there is a unique A-measurable function E(f | A) such that Z ν(A) = E(f | A) dµ.

We call E(f | A) the conditional expectation of f with respect to the σ-algebra A. So far, we have only defined E(f | A) for non-negative f . To define E(f | A) for an

arbitrary f , we split f into positive and negative parts f = f+ −f− where f+, f− ≥ 0 and define

E(f | A) = E(f+ | A) − E(f− | A).

Thus we can view conditional expectation as an operator

E(· | A): L1(X, B, µ) → L1(X, A, µ).

Note that E(f | A) is uniquely determined by the two requirements that

(i) E(f | A) is A-measurable, and R R (ii) A f dµ = A E(f | A) dµ for all A ∈ A. Intuitively, one can think of E(f | A) as the best approximation to f in the smaller space of all A-measurable functions. Exercise 9.5. (i) Prove that f 7→ E(f | A) is linear.

(ii) Suppose that g is A-measurable and |g| < ∞ µ-a.e. Show that E(f g | A) = gE(f | A).

(iii) Suppose that T is a measure-preserving transformation. Show that E(f | A) ◦ T = E(f ◦ T | T −1A).

(iv) Show that E(f | B) = f .

70 9.3 Birkhoff’s Pointwise Ergodic Theorem 9 BIRKHOFF’S ERGODIC THEOREM

(v) Let N denote the trivial σ-algebra consisting of all sets of measure 0 and 1. Show that E(f | N ) = R f dµ. To state Birkhoff’s Ergodic Theorem precisely, we will need the sub-σ-algebra I of T - invariant subsets, namely:

I = {B ∈ B | T −1B = B a.e.}.

Exercise 9.6. Prove that I is a σ-algebra.

9.3 Birkhoff’s Pointwise Ergodic Theorem

1 Pn−1 j Birkhoff’s Ergodic Theorem deals with the behaviour of n j=0 f (T x) for µ-a.e. x ∈ X, and for f ∈ L1(X, B, µ).

Theorem 9.7 (Birkhoff’s Ergodic Theorem). Let (X, B, µ) be a probability space and let T : X → X be a measure-preserving transformation. Let I denote the σ-algebra of T - invariant sets. Then for every f ∈ L1(X, B, µ), we have

n−1 1 X f (T j x) → E(f | I) n j=0 for µ-a.e. x ∈ X.

Corollary 9.8. Let (X, B, µ) be a probability space and let T : X → X be an ergodic measure- preserving transformation. Let f ∈ L1(X, B, µ). Then

n−1 1 X Z f (T j x) → f dµ, as n → ∞, n j=0 for µ-a.e. x ∈ X.

Proof. If T is ergodic then I is the trivial σ-algebra N consisting of sets of measure 0 and 1. If f ∈ L1(X, B, µ) then E(f | N ) = R f dµ. The result follows from the general version of Birkhoff’s ergodic theorem.

9.4 The proof of Birkhoff’s Ergodic Theorem The proof is something of a tour de force of hard analysis. It is based on the following inequality.

Theorem 9.9 (Maximal Inequality). Let (X, B, µ) be a probability space, let T : X → X be 1 a measure-preserving transformation and let f ∈ L (X, B, µ). Define f0 = 0 and, for n ≥ 1,

n−1 fn = f + f ◦ T + ··· + f ◦ T .

71 9.4 The proof of Birkhoff’s Ergodic Theorem 9 BIRKHOFF’S ERGODIC THEOREM

For n ≥ 1, set Fn = max fj 0≤j≤n (so that F ≥ 0). Then n Z f dµ ≥ 0. {x|Fn(x)>0} 1 Proof. Clearly Fn ∈ L (X, B, µ). For 0 ≤ j ≤ n, we have Fn ≥ fj , so Fn ◦ T ≥ fj ◦ T . Hence

Fn ◦ T + f ≥ fj ◦ T + f = fj+1 and therefore

Fn ◦ T (x) + f (x) ≥ max fj (x). 1≤j≤n

If Fn(x) > 0 then max fj (x) = max fj (x) = Fn(x), 1≤j≤n 0≤j≤n so we obtain that

f ≥ Fn − Fn ◦ T on the set A = {x | Fn(x) > 0}. Hence Z Z Z f dµ ≥ Fn dµ − Fn ◦ T dµ A A A Z Z = Fn dµ − Fn ◦ T dµ X A Z Z ≥ Fn dµ − Fn ◦ T dµ X X = 0 where we have used

(i) Fn = 0 on X \ A

(ii) Fn ◦ T ≥ 0 (iii) µ is T -invariant.

Corollary 9.10. If g ∈ L1(X, B, µ) and if

( n−1 ) 1 X j Bα = x ∈ X | sup g(T x) > α n≥1 n j=0 then for all A ∈ B with T −1A = A we have that Z g dµ ≥ αµ(Bα ∩ A). Bα∩A

72 9.4 The proof of Birkhoff’s Ergodic Theorem 9 BIRKHOFF’S ERGODIC THEOREM

Proof. Suppose first that A = X. Let f = g − α, then

∞ ( n−1 ) [ X j Bα = x | g(T x) > nα n=1 j=0 ∞ [ = {x | fn(x) > 0} n=1 ∞ [ = {x | Fn(x) > 0} n=1

(since fn(x) > 0 ⇒ Fn(x) > 0 and Fn(x) > 0 ⇒ fj (x) > 0 for some 1 ≤ j ≤ n). Write

Cn = {x | Fn(x) > 0} and observe that Cn ⊂ Cn+1. Thus χCn converges to χBα and so

f χCn converges to f χBα , as n → ∞. Furthermore, |f χCn | ≤ |f |. Hence, by the Dominated Convergence Theorem, Z Z Z Z

f dµ = f χCn dµ → f χBα dµ = f dµ, as n → ∞. Cn X X Bα Applying the maximal inequality, we have, for all n ≥ 1, Z f dµ ≥ 0. Cn Therefore Z f dµ ≥ 0, Bα i.e., Z g dµ ≥ αµ(Bα). Bα For the general case, we work with the restriction of T to A, T : A → A, and apply the maximal inequality on this subset to get Z g dµ ≥ αµ(Bα ∩ A), Bα∩A as required. Proof. Proof of Birkhoff’s Ergodic Theorem Let

n−1 1 X f ∗(x) = lim sup f (T j x) n→∞ n j=0

and n−1 1 X j f∗(x) = lim inf f (T x). n→∞ n j=0

73 9.4 The proof of Birkhoff’s Ergodic Theorem 9 BIRKHOFF’S ERGODIC THEOREM

Writing n−1 1 X a (x) = f (T j x), n n j=0 observe that n + 1 1 a (x) = a (T x) + f (x). n n+1 n n ∗ ∗ Taking the lim sup and lim inf as n → ∞ gives us that f ◦ T = f and f∗ ◦ T = f∗. We have to show

∗ (i) f = f∗ µ-a.e (ii) f ∗ ∈ L1(X, B, µ)

(iii) R f ∗ dµ = R f dµ.

We prove (i). For α, β ∈ R, define

∗ Eα,β = {x ∈ X | f∗(x) < β and f (x) > α}.

Note that ∗ [ {x ∈ X | f∗(x) < f (x)} = Eα,β β<α, α,β∈Q ∗ (a countable union). Thus, to show that f = f∗ µ-a.e., it suffices to show that µ(Eα,β) = 0 ∗ ∗ −1 whenever β < α. Since f∗ ◦ T = f∗ and f ◦ T = f , we see that T Eα,β = Eα,β. If we write

( n−1 ) 1 X j Bα = x ∈ X | sup f (T x) > α n≥1 n j=0 then Eα,β ∩ Bα = Eα,β. Applying Corollary 9.10 we have that Z Z f dµ = f dµ Eα,β Eα,β ∩Bα

≥ αµ(Eα,β ∩ Bα) = αµ(Eα,β).

∗ Replacing f , α and β by −f , −β and −α and using the fact that (−f ) = −f∗ and (−f ) = −f ∗, we also get ∗ Z f dµ ≤ βµ(Eα,β). Eα,β Therefore

αµ(Eα,β) ≤ βµ(Eα,β) ∗ and since β < α this shows that µ(Eα,β) = 0. Thus f = f∗ µ-a.e. and

n−1 1 X lim f (T j x) = f ∗(x) µ-a.e. n→∞ n j=0

74 9.4 The proof of Birkhoff’s Ergodic Theorem 9 BIRKHOFF’S ERGODIC THEOREM

We prove (ii). Let n−1 1 X g = f ◦ T j . n n j=0 Then g ≥ 0 and n Z Z gn dµ ≤ |f | dµ

∗ so we can apply Fatou’s Lemma to conclude that limn→∞ gn = |f | is integrable, i.e., that f ∗ ∈ L1(X, B, µ). We prove (iii). For n ∈ N and k ∈ Z, define  k k + 1 Dn = x ∈ X | ≤ f ∗(x) < . k n n For every ε > 0, we have that n n Dk ∩ B k = Dk . n −ε −1 n n Since T Dk = Dk , we can apply Corollary 22.4 again to obtain Z   k n f dµ ≥ − ε µ(Dk ). n n Dk Since ε > 0 is arbitrary, we have Z k n f dµ ≥ µ(Dk ). n n Dk Thus Z ∗ k + 1 n f dµ ≤ µ(Dk ) n n Dk Z 1 n ≤ µ(Dk ) + f dµ n n Dk n (where the first inequality follows from the definition of Dk ). Since

[ n X = Dk k∈Z (a disjoint union), summing over k ∈ Z gives Z 1 Z f ∗ dµ ≤ µ(X) + f dµ X n X 1 Z = + f dµ. n X Since this holds for all n ≥ 1, we obtain Z Z f ∗ dµ ≤ f dµ. X X

75 9.5 Consequences of the Ergodic Theorem 9 BIRKHOFF’S ERGODIC THEOREM

Applying the same argument to −f gives Z Z (−f )∗ dµ ≤ −f dµ so that Z Z Z ∗ f dµ = f∗ dµ ≥ f dµ.

Therefore Z Z f ∗ dµ = f dµ,

as required. Finally, we prove that f ∗ = E(f | I). First note that as f ∗ is T -invariant, it is measurable with respect to I. Moreover, if I is any T -invariant set then Z Z f , dµ = f ∗ dµ. I I Hence f ∗ = E(f | I).

9.5 Consequences of the Ergodic Theorem Here we give some simple corollaries of Birkhoff’s Ergodic Theorem. The first result says that, for a typical orbit of an ergodic dynamical system, ‘time averages’ equal ‘space averages’.

Corollary 9.11. If T is ergodic and if B ∈ B then for µ-a.e. x ∈ X, the frequency with which the orbit of x lies in B is given by µ(B), i.e., 1 lim card{j ∈ {0, 1, . . . , n − 1} | T j x ∈ B} = µ(B) µ-a.e. n→∞ n

Proof. Apply the Birkhoff Ergodic Theorem with f = χB. It is possible to characterise ergodicity in terms of the behaviour of sets, rather than points, under iteration. The next result deals with this.

Theorem 9.12. Let (X, B, µ) be a probability space and let T : X → X be a measure- preserving transformation. The following are equivalent:

(i) T is ergodic;

(ii) for all A, B ∈ B, n−1 1 X µ(T −j A ∩ B) → µ(A)µ(B), n j=0 as n → ∞.

76 9.6 Normal numbers 9 BIRKHOFF’S ERGODIC THEOREM

1 Proof. (i) ⇒ (ii): Suppose that T is ergodic. Since χA ∈ L (X, B, µ), Birkhoff’s Ergodic Theorem tells us that n−1 1 X χ ◦ T j → µ(A), as n → ∞, n A j=0

µ-a.e. Multiplying both sides by χB gives

n−1 1 X χ ◦ T j χ → µ(A)χ , as n → ∞, n A B B j=0

µ-a.e. Since the left-hand side is bounded (by 1), we can apply the Dominated Convergence Theorem to see that

n−1 n−1 1 X 1 X Z µ(T −j A ∩ B) = χ ◦ T j χ dµ n n A B j=0 j=0 n−1 Z 1 X = χ ◦ T j χ dµ → µ(A)µ(B), n A B j=0 as n → ∞. (ii) ⇒ (i): Now suppose that the convergence holds. Suppose that T −1A = A and take B = A. Then µ(T −j A ∩ B) = µ(A) so

n−1 1 X µ(A) → µ(A)2, n j=0 as n → ∞. This gives µ(A) = µ(A)2. Therefore µ(A) = 0 or 1 and so T is ergodic.

9.6 Normal numbers A number x ∈ [0, 1) is called normal (in base 2) if it has a unique binary expansion, the digit 0 occurs in its binary expansion with frequency 1/2, and the digit 1 occurs in its binary expansion with frequency 1/2. We will show that Lebesgue a.e. x ∈ [0, 1) is normal. To see this, observe that Lebesgue almost every x ∈ [0, 1) has a unique binary expansion n−1 x = ·x1x2 ..., xi ∈ {0, 1}. Define T x = 2x mod 1. Then xn = 0 if and only if T x ∈ [0, 1/2). Thus n−1 1 1 X card{1 ≤ i ≤ n | x = 0} = χ (T i x). n i n [0,1/2) i=0 Since T is ergodic (with respect to Lebesgue measure), for Lebesgue almost every point x R the above expression converges to χ[0,1/2)(x) dx = 1/2. Similarly the frequency with which the digit 1 occurs is equal to 1/2. Hence Lebesgue almost every point in [0, 1) is normal. Exercise 9.13. (i) Let r ≥ 2. What would it mean to say that a number x ∈ [0, 1) is normal in base r?

77 9.7 Continued fractions 9 BIRKHOFF’S ERGODIC THEOREM

(ii) Prove that for each r, Lebesgue a.e. x ∈ [0, 1) is normal in base r.

(iii) Conclude that Lebesgue a.e. x ∈ [0, 1) is simultaneously normal in every base r = 2, 3, 4,.... Exercise 9.14. Prove that the arithmetic mean of the digits appearing in the base 10 expansion P∞ j+1 of Lebesgue-a.e. x ∈ [0, 1) is equal to 4.5, i.e. prove that if x = j=0 xj /10 , xj ∈ {0, 1,..., 9} then 1 lim (x0 + x1 + ··· + xn−1) = 4.5 a.e. n→∞ n

9.7 Continued fractions We will show that for Lebesgue a.e. x ∈ (0, 1) the frequency with which the k occurs in the continued fraction expansion of x is given by 1  (k + 1)2  log . log 2 k(k + 2) Let λ denote Lebesgue measure and let µ denote Gauss’ measure. Then λ-a.e. and µ-a.e. x ∈ (0, 1) is irrational and has an infinite continued fraction expansion 1 x = 1 . x1 + 1 x2+ x + 1 3 x4+···

n−1 Let T denote the continued fraction map. Then xn = [1/T x]. n−1 Fix k ∈ N. Then xn = k precisely when [1/T x] = k, i.e. 1 k ≤ < k + 1 T n−1x which is equivalent to requiring 1 1 < T n−1x ≤ . k + 1 k Hence

n−1 1 1 X card{1 ≤ i ≤ n | x = k} = χ (T i x) n i n (1/(k+1),1/k] i=0 Z → χ(1/(k+1),1/k] dµ for µ-a.e. x 1   1  1  = log 1 + − log 1 + log 2 k k + 1 1 (k + 1)2 = log . log 2 k(k + 2) As µ and λ are equivalent, this holds for Lebesgue almost every point.

78 9.8 Appendix 9 BIRKHOFF’S ERGODIC THEOREM

Exercise 9.15. (i) Deduce from Birkhoff’s Ergodic Theorem that if T is an ergodic measure- preserving transformation of a probability space (X, B, µ) and f ≥ 0 is measurable but R f dµ = ∞ then n−1 1 X f (T j x) → ∞ µ-a.e. n j=0 1 (Hint: define fM = min{f , M} and note that fM ∈ L (X, B, µ). Apply Birkhoff’s Ergodic Theorem to each fM.)

(ii) For x ∈ (0, 1) \ Q write its infinite continued fraction expansion as 1 x = 1 . x1 + 1 x2+ x3+··· Show that for Lebesgue almost every x ∈ (0, 1) we have 1 (x + x + ··· + x ) → ∞ n 1 2 n as n → ∞. (That is, for a typical point x, the average value of the co-efficients in its continued fraction expansion is infinite.)

9.8 Appendix

Completion of the proof of Theorem 7.18. Suppose that µ is ergodic and that µ = αµ1 + (1 − α)µ2, with µ1, µ2 ∈ M(X,T ) and 0 < α < 1. We shall show that µ1 = µ (so that µ2 = µ, also), i.e., that µ is extremal. If µ(A) = 0 then µ1(A) = 0, so µ1  µ. Therefore the Radon-Nikodym derivative dµ1/dµ ≥ 0 exists. One can easily deduce from the statement of the Radon-Nikodym Theorem that µ1 = µ if and only if dµ1/dµ = 1 µ-a.e. We shall show that this is indeed the case by showing that the sets where, respectively, dµ1/dµ < 1 and dµ1/dµ > 1 both have µ-measure zero. Let  dµ  B = x ∈ X : 1 (x) < 1 . dµ We have that (*) Z dµ Z dµ Z dµ 1 dµ + 1 dµ = 1 dµ B∩T −1B dµ B\T −1B dµ B dµ −1 = µ1(B) = µ1(T B) Z dµ = 1 dµ T −1B dµ Z dµ Z dµ = 1 dµ + 1 dµ. B∩T −1B dµ T −1B\B dµ

79 9.8 Appendix 9 BIRKHOFF’S ERGODIC THEOREM

Comparing the first and last terms, we see that Z dµ Z dµ 1 dµ = 1 dµ. B\T −1B dµ T −1B\B dµ In fact, these integrals are taken over sets of the same µ-measure:

µ(T −1B\B) = µ(T −1B) − µ(T −1B ∩ B) = µ(B) − µ(T −1B ∩ B) = µ(B\T −1B).

However on the LHS of (*) the integrand dµ1/dµ < 1 and on the RHS of (*) the integrand −1 −1 dµ1/dµ ≥ 1. Thus we conclude that µ(B\T B) = µ(T B\B) = 0, which is to say that µ(T −1B4B) = 0. Therefore (since T is ergodic) by Corollary 8.5, µ(B) = 0 or µ(B) = 1. We can rule out the possibility that µ(B) = 1 by observing that if µ(B) = 1 then Z Z dµ1 dµ1 1 = µ1(X) = dµ = dµ < µ(B) = 1, X dµ B dµ a contradiction. Therefore µ(B) = 0. If we define  dµ  C = x ∈ X : 1 (x) > 1 dµ then repeating essentially the same argument gives µ(C) = 0. Hence  dµ  µ x ∈ X : 1 (x) = 1 = µ(X\(B ∪ C)) dµ = µ(X) − µ(B) − µ(C) = 1,

i.e., dµ1/dµ = 1 µ-a.e. Therefore µ1 = µ, as required.

80 10

10 Entropy

10.1 The Classification Problem The classification problem is to decide when two measure-preserving transformations are ‘the same’? We say that two measure-preserving transformations are ‘the same’ if they are (measure theoretically) isomorphic.

Definition 10.1. We say that two measure-preserving transformations (X, B, µ, T ) and (Y, C, m, S) are (measure theoretically) isomorphic if there exist M ∈ B and N ∈ C such that

(i) TM ⊂ M, SN ⊂ N,

(ii) µ(M) = 1, m(N) = 1, and there exists a bijection φ : M → N such that (i) φ, φ−1 are measurable and measure-preserving,

(ii) φ ◦ T = S ◦ φ. Remark 10.2. We often say ‘metrically isomorphic’ in place of ‘measure-theoretically isom- porphic’. Here, ‘metric’ is a contraction of ‘measure-theoretic’; it has no connection with metric spaces! How can we decide whether two measure-preserving transformations are isomorphic? A partial answer is given by looking for isomorphism invariants. The most important and successful invariant is the (measure theoretic) entropy. This is a non-negative real number that characterizes the of the measure-preserving transformation. It was introduced by Kolmogorov and Sinai in 1958/59 and immediately solved the outstanding open problem in the subject: whether, for example,

S : R/Z → R/Z : x 7→ 2x mod 1 T : R/Z → R/Z : x 7→ 3x mod 1 are isomorphic. (The invariant measure is Lebesgue in each case.) The answer is no since the systems have different (log 2 and log 3, respectively).

10.2 Conditional expectation Recall that if A ⊂ B is a σ-algebra then we define the operator

E(· | A): L1(X, B, µ) → L1(X, A, µ) such that if f ∈ L1(X, B, µ) then

(i) E(f | A) is A-measurable, and R R (ii) for all A ∈ A, A E(f | A) dµ = A f dµ.

81 10.3 Information and Entropy 10 ENTROPY

Definition 10.3. Let A ⊂ B be a σ-algebra. We define the conditional probability of B ∈ B given A to be the function µ(B | A) = E(χB | A). Suppose that α is a countable partition of the set X. By this we mean that α =

{A1,A2,...}, Ai ∈ B, and

(i) ∪i Ai = X

(ii) Ai ∩ Aj = ∅ if i 6= j

(up to sets of measure zero). (More precisely µ(∪i Ai 4X) = 0 and µ(Ai 4Aj ) = 0 if i 6= j.) The partition α generates a σ-algebra. By an abuse of notation, we denote this σ-algebra by α. The conditional expectation of an integrable function f with respect to the partition α is easily seen to be: R X f dµ E(f | α) = χ (x) A . A µ(A) A∈α Finally, we will need the following useful result.

Theorem 10.4 (Increasing martingale theorem). Let A1 ⊂ A2 ⊂ · · · ⊂ An ⊂ · · · be an increasing sequence of σ-algebras such that An ↑ A (i.e. ∪nAn generates A). Then

(i) E(f | An) → E(f | A) a.e., and

1 (ii) E(f | An) → E(f | A) in L , i.e. Z |E(f | An) − E(f | A)| dµ → 0

as n → ∞.

Proof. Omitted.

10.3 Information and Entropy We begin with some motivation. Suppose we are trying to locate a point x in a probability

space (X, B, µ). To do this we use a (countable) partition α = {A1,A2,...}, Aj ∈ B. By this we mean that

(i) ∪i Ai = X

(ii) Ai ∩ Aj = ∅ if i 6= j

(up to sets of measure zero). (More precisely µ(∪i Ai 4X) = 0 and µ(Ai 4Aj ) = 0 if i 6= j.) If we find that x ∈ Ai then we have received some information. We want to define a function + I(α): X → R

82 10.3 Information and Entropy 10 ENTROPY

such that I(α)(x) is the amount of information we receive on learning that x ∈ Ai . We would like this to depend only on the size of Ai , i.e., µ(Ai ), and to be large when µ(Ai ) is small and small when µ(Ai ) is large. Thus we require I(α) to have the form X I(α)(x) = χA(x)φ(µ(A)) (6) A∈α

for some function φ : [0, 1] → R+, as yet unspecified. Let α = {A1,A2,A3,...}, β = {B1,B2,B3,...} be two partitions. Define the join α ∨ β of α and β to be the partition

α ∨ β = {A ∩ B | A ∈ α, B ∈ β}.

We say that two partitions α, β are independent if µ(A ∩ B) = µ(A)µ(B), whenever A ∈ α, B ∈ β. It is then natural to require that if α and β are two independent partitions then

I(α ∨ β) = I(α) + I(β). (7)

That is, if α and β are independent then the amount of information we obtain by knowing which element of α ∨ β we are in is equal to the amount of information we obtain by knowing which element of α we are in together with the amount of information we obtain by knowing which element of β we are in. Applying (7) to (6), we see that we have

φ(µ(A ∩ B)) = φ(µ(A)µ(B)) = φ(µ(A)) + φ(µ(B)).

If we also want φ to be continuous, then φ(t) must be (a multiple of) − log t. Throughout, we will use the convention that 0 × log 0 = 0.

Definition 10.5. Given a partition α, we define the information I(α): X → R+ by X I(α)(x) = − χA(x) log µ(A). A∈α

We define the entropy H(α) of the partition α to be the average value, i.e., Z H(α) = I(α) dµ

Z X = − χA log µ(A) dµ A∈α X = − µ(A) log µ(A). A∈α

83 10.4 Conditional Information and Entropy 10 ENTROPY

10.4 Conditional Information and Entropy Let A be a sub-σ-algebra of B. We define the conditional information of α given A to be X I(α|A)(x) = − χA(x) log µ(A|A)(x), A∈α where µ(A|A) = E(χA|A). Once again, the conditional entropy H(α|A) is the average

Z X H(α|A) = − χA log µ(A|A) dµ A∈α Z X = − µ(A|A) log µ(A|A) dµ A∈α

(by one of the properties of conditional expectation and the Monotone Convergence Theo- rem). As a special case, we have that I(α|N ) = I(α) and H(α|N ) = H(α), where N is the trivial σ-algebra consisting of sets of measure 0 and 1. Exercise 10.6. Show that H(α|A) = 0 (or I(α|A) ≡ 0) if and only if α ⊂ A. (In particular, H(α|B) = 0, I(α|B) ≡ 0.)

10.5 Basic Properties Recall that if α is a countable partition of a measurable space (X, B) and if C ⊂ B is a sub- σ-algebra then we define the conditional information and conditional entropy of α relative to C to be X I(α | C) = − χA log µ(A | C) A∈α and Z X H(α | C) = − χA log µ(A | C) A∈α Z X = − µ(A | C) log µ(A | C), A∈α respectively. Exercise 10.7. Show that if γ is a countable partition of X then R X χA dµ µ(A | γ) = χ C C µ(C) C∈γ X µ(A ∩ C) = χ . C µ(C) C∈γ

84 10.5 Basic Properties 10 ENTROPY

Lemma 10.8 (The Basic Identities). For three countable partitions α, β, γ we have that

I(α ∨ β | γ) = I(α | γ) + I(β | α ∨ γ), H(α ∨ β | γ) = H(α | γ) + H(β | α ∨ γ).

Proof. We only need to prove the first identity, the second follows by integration. If x ∈ A ∩ B, A ∈ α, B ∈ β, then

I(α ∨ β | γ)(x) = − log µ(A ∩ B | γ)(x)

and X µ(A ∩ B ∩ C) µ(A ∩ B | γ) = χ C µ(C) C∈γ (exercise). Thus, if x ∈ A ∩ B ∩ C, A ∈ α, B ∈ β, C ∈ γ, we have µ(A ∩ B ∩ C) I(α ∨ β | γ)(x) = − log . µ(C) On the other hand, if x ∈ A ∩ C, A ∈ α, C ∈ γ, then µ(A ∩ C) I(α | γ)(x) = − log µ(C) and if x ∈ A ∩ B ∩ C, A ∈ α, B ∈ β, C ∈ γ, then µ(B ∩ A ∩ C) I(β | α ∨ β)(x) = − log . µ(A ∩ C) Hence, if x ∈ A ∩ B ∩ C, A ∈ α, B ∈ β, C ∈ γ, we have µ(A ∩ B ∩ C) I(α | γ)(x) + I(β | α ∨ γ)(x) = − log = I(α ∨ β | γ)(x). µ(C)

Definition 10.9. Let α and β be countable partitions of X. We say that β is a refinement of α and write α ≤ β if every set in α is a union of sets in β. Exercise 10.10. Show that if α ≤ β then I(α | β) = 0. (This corresponds to an intuitive understand as to how information should behave: if α ≤ β then we receive no information knowing which element of α a point is in, given that we know which element of β it lies in.)

Corollary 10.11. If γ ≥ β then

I(α ∨ β | γ) = I(α | γ), H(α ∨ β | γ) = H(α | γ).

Proof. If γ ≥ β then β ≤ γ ≤ α ∨ γ and so I(β | α ∨ γ) ≡ 0, H(β | α ∨ γ) = 0. The result now follows from the Basic Identities.

85 10.5 Basic Properties 10 ENTROPY

Corollary 10.12. If α ≥ β then

I(α | γ) ≥ I(β | γ), H(α | γ) ≥ H(β | γ).

Proof. If α ≥ β then

I(α | γ) = I(α ∨ β | γ) = I(β | γ) + I(α | β ∨ γ) ≥ I(β | γ).

The same argument works for entropy.

We next need to show the harder result that if γ ≥ β then H(α | β) ≥ H(α | γ). This requires the following inequality.

Proposition 10.13 (Jensen’s Inequality). Let φ : [0, 1] → R+ be continuous and concave (i.e., for 0 ≤ p ≤ 1, φ(px + (1 − p)y) ≥ pφ(x) + (1 − p)φ(y)). Let f : X → [0, 1] be measurable (on (X, B)) and let A be a sub-σ-algebra of B. Then

φ(E(f | A)) ≥ E(φ(f ) | A) µ-a.e.

Proof. Omitted.

As a consequence we obtain:

Lemma 10.14. If γ ≥ β then H(α | β) ≥ H(α | γ).

Remark 10.15. Note!: the corresponding statement for information is not true.

Proof. Set φ(t) = −t log t, 0 < t ≤ 1, φ(0) = 0; this is continuous and concave on [0, 1]. Pick A ∈ α and define f (x) = µ(A | γ)(x) = E(χA | γ)(x). Then, applying Jensen’s Inequality with β = A ⊂ γ = B, we have

φ(E(f | β)) ≥ E(φ(f ) | β).

Now, by one of the properties of conditional expectation,

E(f | β) = E(E(χA | γ) | β) = E(χA | β) = µ(A | β).

Therefore, we have that

−µ(A | β) log µ(A | β) = φ(µ(A | β)) ≥ E(−µ(A | γ) log µ(A | γ) | β).

Integrating, we can remove the conditional expectation on the right-hand side and obtain Z Z −µ(A | β) log µ(A | β) dµ ≥ −µ(A | γ) log µ(A | γ) dµ.

Finally, summing over A ∈ α gives H(α | β) ≥ H(α | γ).

86 10.6 Entropy of a Transformation Relative to a Partition 10 ENTROPY

10.6 Entropy of a Transformation Relative to a Partition We are now (at last!) in a position to bring measure-preserving transformations back into the picture. We are going to define the entropy of a measure-preserving transformation T relative to a partition α (with H(α) < +∞). Later we shall remove the dependence on α to obtain the genuine entropy. We first need the following standard analytic lemma.

Lemma 10.16. Let an be a sub-additive sequence of real numbers (i.e. an+m ≤ an + am). Then the sequence an/n converges to its infimum as n → ∞. Proof. Omitted. (As an exercise, you might want to try to prove this.)

Exercise 10.17. Let α be a countable partition of X. Show that T −1α = {T −1A | A ∈ α} is a countable partition of X. Show that H(T −1α) = H(α). Let us write n−1 ! _ −i Hn(α) = H T α . i=0 Using the basic identity (with γ equal to the trivial partition) we have that

n+m−1 ! _ −i Hn+m(α) = H T α i=0 n−1 ! n+m−1 n−1 !

_ −i _ −i _ −i = H T α + H T α T α i=0 i=n i=0 n−1 ! n+m−1 ! _ _ ≤ H T −i α + H T −i α i=0 i=n n−1 ! m−1 ! _ _ = H T −i α + H T −n T −i α i=0 i=0 = Hn(α) + Hm(α).

We have just shown that Hn(α) is a sub-additive sequence. Therefore, by Lemma 10.16, 1 lim Hn(α) n→∞ n exists and we can make the following definition.

Definition 10.18. We define the entropy of a measure-preserving transformation T relative to a partition α (with H(α) < +∞) to be

n−1 ! 1 _ h(T, α) = lim H T −i α . n→∞ n i=0

87 10.7 The entropy of a measure-preserving transformation 10 ENTROPY

Remark 10.19. Since

Hn(α) ≤ Hn−1(α) + H(α) ≤ · · · ≤ nH(α)

we have 0 ≤ h(T, α) ≤ H(α). Remark 10.20. Here is an alternative formula for h(T, α). Let

αn = α ∨ T −1α ∨ · · · ∨ T −(n−1)α.

Then

H(αn) = H(α | T −1α ∨ · · · ∨ T −(n−1)α) + H(T −1α ∨ · · · ∨ T −(n−1)α) = H(α | T −1α ∨ · · · ∨ T −(n−1)α) + H(αn−1).

Hence H(αn) H(α | T −1α ∨ · · · ∨ T −(n−1)α) = n n H(α | T −1α ∨ · · · ∨ T −(n−2)α) + n H(α | T −1α) H(α) + ··· + + . n n Since

H(α | T −1α ∨ · · · ∨ T −(n−1)α) ≤ H(α | T −1α ∨ · · · ∨ T −(n−2)α) ≤ · · · ≤ H(α) and ∞ _ H(α | T −1α ∨ · · · ∨ T −(n−1)α) → H(α | T −i α) i=1 (by the Increasing Martingale Theorem), we have

∞ 1 _ h(T, α) = lim H(αn) = H(α | T −i α). n→∞ n i=1

10.7 The entropy of a measure-preserving transformation Finally, we can define the entropy of T with respect to the measure µ.

Definition 10.21. Let T be a measure-preserving transformation of the probability space (X, B, µ). Then the entropy of T with respect to µ is defined to be

h(T ) = sup{h(T, α) | α is a countable partition such that H(α) < ∞}.

88 10.8 Entropy as an isomorphism invariant 10 ENTROPY

10.8 Entropy as an isomorphism invariant Recall the definition of what it means to say that two measure-preserving transformations are metrically isomorphic.

Definition 10.22. We say that two measure-preserving transformations (X, B, µ, T ) and (Y, C, m, S) are (measure theoretically) isomorphic if there exist M ∈ B and N ∈ C such that

(i) TM ⊂ M, SN ⊂ N,

(ii) µ(M) = 1, m(N) = 1, and there exists a bijection φ : M → N such that

(i) φ, φ−1 are measurable and measure-preserving (i.e. µ(φ−1A) = m(A) for all A ∈ C),

(ii) φ ◦ T = S ◦ φ.

We prove that two metrically isomorphic measure-preserving transformations have the same entropy.

Theorem 10.23. Let T : X → X be a measure-preserving of (X, B, µ) and let S : Y → Y be a measure-preserving transformation of (Y, C, m). If T and S are isomorphic then h(T ) = h(S).

Proof. Let M ⊂ X, N ⊂ Y and φ : M → N be as above. If α is a partition of Y then (changing it on a set of measure zero if necessary) it is also a partition of N. The inverse image φ−1α = {φ−1A | A ∈ α} is a partition of M and hence of X. Furthermore,

−1 X −1 −1 Hµ(φ α) = − µ(φ A) log µ(φ A) A∈α X = − m(A) log m(A) A∈α

= Hm(α).

More generally,

n−1 ! n−1 !! _ −j −1 −1 _ −j Hµ T (φ α) = Hµ φ S α j=0 j=0 n−1 ! _ −j = Hm S α . j=0

Therefore, dividing by n and letting n → ∞, we have

h(S, α) = h(T, φ−1α).

89 10.9 Calculating entropy 10 ENTROPY

Thus

h(S) = sup{h(S, α) | α partition of Y,Hm(α) < ∞} −1 = sup{h(T, φ α) | α partition of Y,Hm(α) < ∞}

≤ sup{h(T, β) | β partition of X,Hµ(β) < ∞} = h(T ).

By symmetry, we also have h(T ) ≤ h(S). Therefore h(T ) = h(S). Note that the converse to Theorem 10.23 is false in general: if two measure-preserving transformations have the same entropy then they are not necessarily metrically isomorphic.

10.9 Calculating entropy At first sight, the entropy of a measure-preserving transformation seems hard to calculate as it involves taking a supremum over all possible (finite entropy) partitions. However, some short cuts are possible.

10.10 Generators and Sinai’s theorem A major complication in the definition of entropy is the need to take the supremum over all finite entropy partitions. Sinai’s theorem guarantees that h(T ) = h(T, α) for a partition α whose refinements generates the full σ-algebra. We begin by proving the following result.

Theorem 10.24 (Abramov’s theorem). Suppose that α1 ≤ α2 ≤ · · · ↑ B are countable partitions such that H(αn) < ∞ for all n ≥ 1. Then

h(T ) = lim h(T, αn). n→∞ Proof. Choose any countable partition β such that H(β) < ∞. Fix n > 0. Then

k−1 ! k−1 k−1 ! _ −j _ −j _ −j H T β ≤ H T β ∨ T αn j=0 j=0 j=0 k−1 ! k−1 k−1 ! _ −j _ −j _ −j ≤ H T αn + H T β | T αn , j=0 j=0 j=0

by the basic identity.

90 10.10 Generators and Sinai’s theorem 10 ENTROPY

Observe that

k−1 k−1 ! _ −j _ −j H T β | T αn j=0 j=0 k−1 ! k−1 k−1 ! _ −j _ −j _ −j = H β | T αn + H T β | β ∨ T αn j=0 j=1 j=0 k−1 k−1 ! _ −j _ −j ≤ H(β|αn) + H T β | T αn j=1 j=1 k−2 k−2 ! _ −j _ −j = H(β|αn) + H T β | T αn . j=0 j=0 Continuing this inductively we see that

k−1 k−1 ! _ −j _ −j H T β | T αn ≤ kH(β|αn). j=0 j=0 Hence

k−1 ! 1 _ h(T, β) = lim H T −j β k→∞ k j=0 k−1 ! 1 _ −j ≤ lim H T αn + H(α | αn) k→∞ k j=0

= h(T, αn) + H(β | αn).

We now prove that H(β | αn) → 0 as n → ∞. To do this, it is sufficient to prove that 1 I(β | αn) → 0 in L as n → ∞. Recall that X I(β | αn)(x) = − χB(x) log µ(B | αn)(x) = − log µ(B | αn)(x) B∈β if x ∈ B, B ∈ β. By the Increasing Martingale Theorem, we know that

µ(B | αn)(x) → χB a.e. Hence for x ∈ B I(β | αn)(x) → − log χB = 0.

Hence for any countable partition β with H(β) < ∞ we have that h(T, β) ≤ limn→∞ h(T, αn). The result follows by taking the supremum over all such β. Definition 10.25. We say that a countable partition α is a generator if T is invertible and

n−1 _ T −j α → B j=−(n−1)

91 10.11 Entropy of a power 10 ENTROPY as n → ∞. We say that a countable partition α is a strong generator if

n−1 _ T −j α → B j=0 as n → ∞. Remark 10.26. To check whether a partition α is a generator (respectively, a strong gen- erator) it is sufficient to check that it separates almost every pair of points. That is, for almost every x, y ∈ X, there exists n such that x, y are in different elements of the partition Wn−1 −j Wn−1 −j j=−(n−1) T α ( j=0 T α, respectively). The following important theorem will be the main tool in calculating entropy.

Theorem 10.27 (Sinai’s theorem). Suppose α is a strong generator or that T is invertible and α is a generator. If H(α) < ∞ then

h(T ) = h(T, α).

Proof. The proofs of the two cases are similar, we prove the case when T is invertible and α is a generator of finite entropy. Let n ≥ 1. Then

n _ h(T, T −j α) j=−n 1 = lim H(T nα ∨ · · · ∨ T −nα ∨ T −(n−1)α ∨ · · · ∨ T −(n+k−1)α) k→∞ k 1 = lim H(α ∨ · · · ∨ T −(2n+k−1)α) k→∞ k = h(T, α) for each n. As α is a generator, we have that

n _ T −j α → B. j=−n By Abramov’s theorem, h(T, α) = h(T ).

10.11 Entropy of a power Observe that if T preserves the measure µ then so does T k . The following result relates the entropy of T and T k .

Theorem 10.28. (i) For k ≥ 0 we have that h(T k ) = kh(T ).

(ii) If T is invertible then h(T ) = h(T −1).

92 10.11 Entropy of a power 10 ENTROPY

Proof. We prove (i), leaving the case k = 0 as an exercise. Choose a countable partition α with H(α) < ∞. Then k−1 ! nk−1 ! _ 1 _ h T k , T −j α = lim H T −j α n→∞ n j=0 j=0 nk−1 ! 1 _ = k lim H T −j α = kh(T, α). n→∞ nk j=0 Thus, kh(T ) = sup kh(T, α) H(α)<∞ k−1 ! _ = sup h T k , T −j α H(α)<∞ j=0 ≤ sup h(T k , α) = h(T k ). H(α)<∞ On the other hand, n−1 ! 1 _ h(T k , α) = lim H T −jk α n→∞ n j=0 nk−1 ! 1 _ ≤ lim H T −j α by Corollary 25.3 n→∞ n j=0 nk−1 ! 1 _ = k lim H T −j α = kh(T, α), n→∞ nk j=0 and so h(T k ) ≤ kh(T ), completing the proof. We prove (ii). We have n−1 ! n−1 ! _ _ H T −j α = H T n−1 T −j α j=0 j=0 n−1 ! _ = H T j α . j=0 Therefore n−1 ! 1 _ h(T, α) = lim H T −j α n→∞ n j=0 n−1 ! 1 _ = lim H T j α = h(T −1, α). n→∞ n j=0 Taking the supremum over α gives h(T ) = h(T −1). Exercise 10.29. Prove that the entropy of the identity map is zero.

93 10.12 Calculating entropy using generators 10 ENTROPY

10.12 Calculating entropy using generators In this subsection, we show how generators and Sinai’s theorem can be used to calculate the entropy for some of our examples.

10.13 Subshifts of finite type Let A be an irreducible k × k matrix with entries from {0, 1}. Recall that we define the shifts of finite type to be the spaces

∞ Z ΣA = {(xn)n=−∞ ∈ {1, . . . , k} | A(xn, xn+1) = 1 for all n ∈ Z}, + ∞ N ΣA = {(xn)n=0 ∈ {1, . . . , k} | A(xn, xn+1) = 1 for all n ∈ N},

+ + and the shift maps σ :ΣA → ΣA, σ :ΣA → ΣA by (σx)n = xn+1. Let P be a stochastic matrix and let p be a normalised left eigenvector so that pP = p.

Suppose that P is compatible with A, so that Pi,j > 0 if and only if A(i, j) = 1. Recall that we define the Markov measure µP by defining it on cylinder sets by

µP [z0, z1, . . . , zn] = pz0 Pz0z1 ··· Pzn−1zn ,

and then extending it to the full σ-algebra by using the Kolmogorov Extension Theorem.

We shall calculate hµP (σ) for the one-sided shift which for notational brevity we denote by σ :ΣA → ΣA; the calculation for the two-sided shift is similar. Let α be the partition {[1],..., [k]} of ΣA into cylinders of length 1. Then

k X H(α) = − µP [i] log µP [i] i=1 k X = − pi log pi < ∞. i=1

Wn −i The partition αn = i=0 σ α consists of all allowed cylinders of length n + 1:

n _ −i σ α = {[z0, z1, . . . , zn] | A(zi , zi+1) = 1, i = 0, . . . , n − 1}. i=0

94 10.13 Subshifts of finite type 10 ENTROPY

Hence α is a strong generator. Moreover, we have

n ! _ H σ−i α i=0 X = − µ[z0, z1, . . . , zn] log µ[z0, z1, . . . , zn]

[z0,z1,...,zn]∈αn X = − pz0 Pz0z1 ··· Pzn−1zn log(pz0 Pz0z1 ··· Pzn−1zn )

[z0,z1,...,zn]∈αn k k X X = − ··· pi0 Pi0i1 ··· Pin−1in log(pi0 Pi0i1 ··· Pin−1in )

i0=1 in=1 k k X X = − ··· pi0 Pi0i1 ··· Pin−1in (log pi0 + log Pi0i1 + ··· + log Pin−1in )

i0=1 in=1 k k X X = − pi0 log pi0 − n pi Pij log Pij ,

i0=1 i,j=1

Pk Pk where we have used the identities j=1 Pij = 1 and i=1 pi Pij = pj . Therefore

hµP (σ) = hµP (σ, α) n ! 1 _ = lim H σ−i α n→∞ n + 1 i=0 k X = − pi Pij log Pij . i,j=1

Exercise 10.30. Carry out the above calculation for a full shift on k symbols with a Bernoulli measure determined by the probability vector p = (p1, . . . , pk ) to show that in this case the Pk entropy is − i=1 pi log pi . Recall from Theorem 11.25 that if two measure-preserving transformations are metri- cally isomorphic then they have the same entropy but that the converse is not necessarily true. However, for Markov measures on two-sided shifts of finite type entropy is a complete invariant:

Theorem 10.31 (Ornstein’s theorem). Any two 2-sided Bernoulli shifts with the same entropy are metrically isomorphic.

Theorem 10.32 (Ornstein and Friedman). Any two 2-sided aperiodic Markov shifts with the same entropy are metrically isomorphic.

Remark 10.33. Both of these theorems are false for 1-sided shifts. The isomorphism problem for 1-sided shifts is a very subtle problem.

95 10.14 The continued fraction map 10 ENTROPY

10.14 The continued fraction map Recall that the continued fraction map is defined by T (x) = 1/x mod 1 and preserves Gauss’ measure µ defined by 1 Z 1 µ(B) = dx. log 2 B 1 + x

Let An = (1/(n + 1), 1/n) and let α be the partition α = {An | n = 1, 2, 3,...}. Exercise 10.34. Check that H(α) < ∞. (Hint: use the fact that Gauss’ measure µ and Lebesgue measure λ are comparable, i.e. there exist constants c, C > 0 such that c ≤ µ(B)/λ(B) ≤ C for all B ∈ B. We claim that α is a strong generator for T . To see this, recall that each irrational x has a distinct continued fraction expansion. Hence α separates irrational, hence almost all, points. For notational convenience let

−1 −(n−1) [x0, . . . , xn−1] = Ax0 ∩ T Ax1 ∩ · · · ∩ T Axn−1 j = {x ∈ [0, 1] | T (x) ∈ Axj for j = 0, . . . , n − 1} so that [x0, . . . , xn−1] is the set of all x ∈ [0, 1] whose continued fraction expansion starts x0, . . . , xn−1. If x ∈ [x0, . . . , xn−1] then µ([x , . . . , x ]) I(α | T −1α ∨ · · · ∨ T −nα) = − log 0 n . µ([x1, . . . , xn])

We will use the following fact: if In(x) is a nested sequence of intervals such that In(x) ↓ {x} as n → ∞ then 1 Z lim f (y) dy = f (x) n→∞ λ(In(x)) In(x) where λ denotes Lebesgue measure. We will also need the fact that λ([x , . . . , x ]) 1 lim 0 n = . 0 n→∞ λ([x1, . . . , xn]) |T (x)| Hence

µ([x0, . . . , xn])

µ([x1, . . . , xn]) R dx [x0,...,xn] 1+x = R dx [x1,...,xn] 1+x R dx , R dx ! λ([x , . . . , x ]) = [x0,...,xn] 1+x [x1,...,xn] 1+x × 0 n λ([x0, . . . , xn]) λ([x1, . . . , xn]) λ([x1, . . . , xn])  1  1  1 → . 1 + x 1 + T x |T 0(x)|

96 10.14 The continued fraction map 10 ENTROPY

Hence ∞ ! _ 1 + T x 1  I α | T −j α = − log . 1 + x |T 0(x)| j=1 Using the fact that µ is T -invariant we see that

∞ ! ∞ ! _ Z _ H α | T −j α = I α | T −j α dµ j=1 j=1 Z 1 = − log dµ |T 0(x)| Z = log |T 0(x)| dµ.

Now T (x) = 1/x mod 1 so that T 0(x) = −1/x 2. Hence

∞ ! _ 2 Z log x h(T ) = H α | T −j α = − dx, log 2 1 + x j=1 which cannot be simplified much further. Exercise 10.35. Define T : [0, 1] → [0, 1] to be the doubling map T (x) = 2x mod 1. Let µ denote Lebesgue measure. We know that µ is a T -invariant probability measure. Prove that h(T ) = log 2. Exercise 10.36. Define T : [0, 1] → [0, 1] by T (x) = 4x(1 − x). Define the measure µ by 1 Z 1 µ(B) = p dx. π B x(1 − x)

We have seen in a previous exercise that µ is an invariant probability measure. Show that h(T ) = log 2. (Hint: you may use the fact that the partition α = {[0, 1/2], [1/2, 1]} is a strong gener- ator.) Exercise 10.37. Let β > 1 by the golden mean, so that β2 = β+1. Define T (x) = βx mod 1. Define the density  1 on [0, 1/β) 1 + 1  β β3 k(x) = 1   on [1/β, 1).  β 1 + 1 β β3 and define the measure Z µ(B) = k(x) dx. B In a previous exercise, we saw that µ is T -invariant. Assuming that α = {[0, 1/β), [1/β, 1]} is a strong generator, show that h(T ) = log β.

97 10.14 The continued fraction map 10 ENTROPY

Exercise 10.38. Let T (x) = 1/x mod 1 and let 1 Z 1 µ(B) = dx log 2 B 1 + x be Gauss’ measure. ∞ Let Ak = (1/(k + 1), 1/k]. Explain why α = {Ak }k=1 is a strong generator for T . Show that the entropy of T with respect to µ can be written as

1 Z 1 log x 2 h(T ) = − dx. log 2 0 1 + x

98