Chapter 2

Basic Theory

To repeat what we said in the Chapter 1, a Markov chain is a discrete-time stochastic process X1, X2, . . . taking values in an arbitrary state space that has the Markov property and stationary transition probabilities:

• the conditional distribution of Xn given X1, . . . , Xn−1 is the same as the conditional distribution of Xn given Xn−1 only, and

• the conditional distribution of Xn given Xn−1 does not depend on n.

The conditional distribution of Xn given Xn−1 specifies the transition proba- bilities of the chain. In order to completely specify the probability law of the chain, we need also specify the initial distribution , the distribution of X1.

2.1 Transition Probabilities 2.1.1 Discrete State Space For a discrete state space S, the transition probabilities are specified by defining a matrix

P (x, y ) = Pr( Xn = y|Xn−1 = x), x,y ∈ S (2.1) that gives the probability of moving from the point x at time n − 1 to the point y at time n. Because of the assumption of stationary transition probabilities, the transition probability matrix P (x, y ) does not depend on the time n. Some readers may object that we have not defined a “matrix.” A matrix (I can hear such readers saying) is a rectangular array P of numbers pij , i = 1, . . . , m, j = 1, . . . , n, called the entries of P . Where is P ? Well, enumerate the points in the state space S = {x1,. . . , x d}, then

pij = Pr {Xn = xj|Xn−1 = xi}, i = 1 ,...,d, j = 1 , . . . d. I hope I can convince you this view of “matrix” is the Wrong Thing. There are two reasons.

25 CHAPTER 2. BASIC MARKOV CHAIN THEORY 26

First, the enumeration of the state space does no work. It is an irrelevancy that just makes for messier notation. The mathematically elegant definition of a matrix does not require that the index sets be {1, . . . , m } and {1,...,n } for some integers m and n. Any two finite sets will do as well. In this view, a matrix is a function on the Cartesian product of two finite sets. And in this view, the function P defined by (2.1), which is a function on S × S, is a matrix. Following the usual notation of set theory, the space of all real-valued func- tions on a set A is written RA. This is, of course, a d-dimensional vector space when A has d points. Those who prefer to write Rd instead of RA may do so, but the notation RA is more elegant and corresponds to our notion of A being the index set rather than {1,...,d }. So our matrices P being functions on S ×S are elements of the d2-dimensional vector space RS×S. The second reason is that P is a conditional probability mass function. In most contexts, (2.1) would be written p(y|x). For a variety of reasons, partly the influence of the matrix analogy, we write P (x, y ) instead of p(y|x) in Markov chain theory. This is a bit confusing at first, but one gets used to it. It would be much harder to see the connection if we were to write pij instead of P (x, y ). Thus, in general, we define a transition probability matrix to be a real-valued function P on S × S satisfying P (x, y ) ≥ 0, x,y ∈ S (2.2a) and P (x, y ) = 1 . (2.2b) ∈ yXS The state space S must be countable for the definition to make sense. When S is not finite, we have an infinite matrix. Any matrix that satisfies (2.2a) and (2.2b) is said to be Markov or stochastic . Example 2.1. with Reflecting Boundaries. Consider the symmetric random walk on the integers 1, . . . , d with “reflecting boundaries.” This means that at each step the chain moves one unit up or down 1 with equal probabilities, 2 each way, except at the end points. At 1, the lower 1 end, the chain still moves up to 2 with probability 2 , but cannot move down, there being no points below to move to. Here when it wants to go down, which 1 is does with probability 2 , it bounces off an imaginary reflecting barrier back to where it was. The behavior at the upper end is analogous. This gives a transition matrix 1 1 2 2 0 0 . . . 0 0 0 1 1 2 0 2 0 . . . 0 0 0  1 1  0 2 0 2 . . . 0 0 0  1  0 0 2 0 . . . 0 0 0    (2.3)  ......   ......   1  0 0 0 0 . . . 0 0  2  0 0 0 0 . . . 1 0 1   2 2  0 0 0 0 . . . 0 1 1   2 2    CHAPTER 2. BASIC MARKOV CHAIN THEORY 27

We could instead use functional notation

1/2, |x − y| = 1 or x = y = 1 or x = y = d P (x, y ) = (0, otherwise

Either works. We will use whichever is most convenient.

2.1.2 General State Space For a general state space S the transition probabilities are specified by defin- ing a kernel

P (x, B ) = Pr {Xn ∈ B|Xn−1 = x}, x ∈ S, B a measurable set in S, satisfying • for each fixed x the function B 7→ P (x, B ) is a probability , and

• for each fixed B the function x 7→ P (x, B ) is a measurable function.

In other words, the kernel is a regular conditional probability (Breiman 1968, Section 4.3). Lest the reader worry that this definition signals an impending blizzard of measure theory, let me assure you that it does not. A little bit of measure theory is unavoidable in treating this subject, if only because the major reference works on Markov chains, such as Meyn and Tweedie (1993), are written at that level. But in practice measure theory is entirely dispensable in MCMC, because the computer has no sets of measure zero or other measure-theoretic paraphernalia. So if a Markov chain really exhibits measure-theoretic pathology, it can’t be a good model for what the computer is doing. In any case, we haven’t hit serious measure theory yet. The main reason for introducing kernels here is purely notational. It makes unnecessary a lot of useless discussion of special cases. It allows us to write expressions like

E{g(Xn)|Xn−1 = x} = P (x, dy )g(y) (2.4) Z using one notation for all cases. Avoiding measure-theoretic notation leads to excruciating contortions. Sometimes the distribution of Xn given Xn−1 is a continuous distribution on Rd with density f(y|x). Then the kernel is defined by

P (x, B ) = f(y|x) dy ZB and (2.4) becomes

E{g(Xn)|Xn−1 = x} = g(y)f(y|x) dy. Z CHAPTER 2. BASIC MARKOV CHAIN THEORY 28

Readers who like boldface for “vectors” can supply the appropriate boldface. Since both x and y here are elements of Rd, every variable is boldfaced. I don’t like the “vectors are boldface” convention. It is just one more bit of distinguishing trivial special cases that makes it much harder to see what is common to all cases. Often the distribution of Xn given Xn−1 is more complicated. A common situation in MCMC is that the distribution is continuous except for an atom at x. The chain stays at x with probability r(x) and moves with probability 1 − r(x), and when it moves the distribution is given by a density f(y|x). Then (2.4) becomes

E{g(Xn)|Xn−1 = x} = r(x)g(x) + [1 − r(x)] g(y)f(y|x) dy. Z The definition of the kernel in this case is something of a mess

r(x) + [1 − r(x)] f(y|x) dy, x ∈ B P (x, B ) = B (2.5) [1 − r(x)] f(y|x) dy, otherwise ( B R This can be simplified by introducingR the identity kernel (yet more measure- theoretic notation) defined by

1, x ∈ B I(x, B ) = (2.6) (0, x∈ / B which allows us to rewrite (2.5) as

P (x, B ) = r(x)I(x, B ) + [1 − r(x)] f(y|x) dy. ZB We will see why the identity kernel has that name a bit later. Another very common case in MCMC has the distribution of Xn given Xn−1 changing only one component of the state vector, say the i-th. The Gibbs update discussed in Chapter 1 is an example. The distribution of the i-th component has a density f(y|x), but now x is an element of Rd and y is an element of R (not Rd). Then (2.4) becomes

E{g(Xn)|Xn−1 = x} = g(x1,...,x i−1, y, x i+1 ,...,x d)f(y|x) dy. Z The notation for the kernel is even uglier unless we use “probability is a special case of expectation.” To obtain the kernel just take the special case where g is the indicator function of the set B. The virtue of the measure-theoretic notation (2.4) is that it allows us to refer to all of these special cases and many more without getting bogged down in a lot of details that are irrelevant to the point under discussion. I have often wondered why this measure-theoretic notation isn’t introduced in lower CHAPTER 2. BASIC MARKOV CHAIN THEORY 29 level courses. It would avoid tedious repetition, where first we woof about the discrete case, then the continuous case, even rarely the mixed case, thus obscuring what is common to all the cases. One can use the notation without knowing anything about measure-theoretic probability. Just take (2.4) as the definition of the notation. If you understand what expectations mean in the model at hand, then you can write out what the notation means in each case, as we have done above. Regardless of whether you think this would be a good idea in lower level courses, or not, I hope you are convinced that the notation is necessary in dealing with Markov chains. One would never see the forest for the trees without it.

2.1.3 Existence of Infinite Random Sequences Transition probabilities do not by themselves define the probability law of the Markov chain, though they do define the law conditional on the initial position, that is, given the value of X1. In order to specify the unconditional law of the Markov chain we need to specify the initial distribution of the chain, which is the marginal distribution of X1. If λ is the initial distribution and P is the transition kernel and g1, . . . , gn are any real-valued functions, then

E{g1(X1) . . . g n(Xn)}

= · · · λ(dx 1)P (x1, dx 2) · · · P (xn−1, dx n)g1(x1) · · · gn(xn) Z Z provided the expectation exists. This determines the joint probability distri- bution of X1, . . . , Xn for any n. Just take the special case where the gi are indicator functions. Let Qn denote the probability distribution of X1, . . . , Xn, a measure on n the cartesian product S , where S is the state space. The Qn are called the finite-dimensional distributions of the infinite random sequence X1, X2, . . . . The finite-dimensional distributions satisfy the obvious consistency property: Qn(A) = Qn+1 (A×S). It is a theorem of measure-theoretic probability (Fristedt and Gray 1997, Theorem 3 of Chapter 22 and Definition 10 of Chapter 21) that for any consistent sequence of finite-dimensional distributions, there exists a unique Q∞ for the infinite sequence such that Q∞ agrees with the finite-dimensional distributions, that is, if A is a measurable set in Sn and ∞ B = { (x1, x 2,... ) ∈ S : ( x1,...,x n) ∈ A }, then Qn(A) = Q∞(B). We will only rarely refer explicitly or even implicitly to Q∞. One place where it cannot be avoided is the strong law of large numbers, which says that the set of infinite sequences ( X1, X 2,... ) having the property that Xn → µ has probability one, the probability here referring to Q∞, since it refers to probabilities on the space of infinite sequences. But mostly we deal only with CHAPTER 2. BASIC MARKOV CHAIN THEORY 30

finite-dimensional distributions. The CLT, for example, is a statement about finite-dimensional distributions only. Anyway, this issue of Q∞ has nothing to do particularly with Markov chains. It is needed for the SLLN in the i. i. d. case too. If you are not bothered by the SLLN for i. i. d. random sequences, then the SLLN for Markov chains should not bother you either. The measure-theoretic technicalities are exactly the same in both cases.

2.2 Transition Probabilities as Operators

When the state space is finite, we have seen that the transition probabilities form a matrix, an d × d matrix if the state space has d points. From linear algebra, the reader should be familiar with the notion that a matrix represents a linear operator. This is true for Markov transition matrices as well. Actually, we will see it represents two different linear operators. In the general state space case, transition probabilities also represent linear operators. In this case the vector spaces on which they operate are infinite- dimensional. We do not assume the reader should be familiar with these notions and so develop what we need of this theory to work with Markov chains.

2.2.1 Finite State Space Right Multiplication When the state space S is finite (2.4) becomes

E{g(Xn)|Xn−1 = x} = P (x, y )g(y). ∈ yXS Although the notation is unusual, the right hand side corresponds to the matrix multiplication of the matrix P on the right by the “column vector” g. Using this notation we write the function defined by the right hand side as P g . Hence we have P g (x) = E{g(Xn)|Xn−1 = x}. If we were fussy, we might write the left hand side as ( P g )( x), but the extra parentheses are unnecessary, since the other interpretation of P g (x), that P operates on the real number g(x), is undefined. As mentioned above, the vector space of all real-valued functions on S is denoted RS. The operation of right multiplication defined above takes a function S S g in R to another function P g in R . This map RP : g 7→ P g is a linear operator on RS represented by the matrix P . When we are fussy, we distinguish between the matrix P and the linear operator RP it represents, as is common in introductory linear algebra books (Lang 1987, Chapter IV). But none of the Markov chain literature bothers with this distinction. So we will bother with making this distinction only for a little while. Later we will just write P instead of RP as all the experts do, relying on context to make it clear whether P means CHAPTER 2. BASIC MARKOV CHAIN THEORY 31 a matrix or a linear operator. We don’t want the reader to think that making a clear distinction between the matrix P and the linear operator RP is essential. Holding fast to that notational idiosyncrasy will just make it hard for you to read the literature.

Left Multiplication A probability distribution on S is also determines a vector in RS. In this case the vector is the probability mass function λ(x). If Xn−1 has the distribution λ, then the distribution of Xn is given by

Pr( Xn = y) = λ(x)P (x, y ). (2.7) ∈ xXS Again we can recognize a matrix multiplication, this time of the matrix P on the left by the “row vector” λ. Using this notation we write the probability distribution defined by the right hand side as λP . and hence have

λP (y) = Pr( Xn = y), when Xn−1 has the distribution λ. Again if we were fussy, we might write the left hand side as ( λP )( y), but again the extra parentheses are unnecessary, since the other interpretation of λP (y), that P (y) operates on λ, is undefined because P (y) is undefined. Equation (2.7) makes sense when λ is an arbitrary element of RS, in which case we say it represents a signed measure rather than a probability measure. Thus the matrix P also represents another linear operator on RS, the operator LP : λ 7→ λP . Note that LP and RP are not the same operator, because P is not a symmetric matrix, so right and left multiplication produce different results. When we are not being pedantic, we will usually write P instead of LP or RP . So how do we tell these two operators apart? In most contexts only one of the two is being used, so there is no problem. In contexts where both are in use, the notational distinction between P f and λP helps distinguish them.

Invariant Distributions Recall from Section 1.5 that a probability distribution π is an invariant dis- tribution for a specified transition probability matrix P if the Markov chain that results from using π as the initial distribution is stationary. (An invariant dis- tribution is also called a stationary or an equilibrium distribution.) Because the transition probabilities are assumed stationary, as we always do, it is enough to check that Xn−1 ∼ π implies Xn ∼ π. But we have just learned that Xn−1 ∼ λ implies Xn ∼ λP . Hence we can use our new notation to write the charac- terization of invariant distributions very simply: a probability distribution π is invariant for a transition probability matrix P if and only if π = πP . Recall from Section 1.7 that the “first task in MCMC” is to find a Markov update mechanism that preserves a specified distribution. Now we can state CHAPTER 2. BASIC MARKOV CHAIN THEORY 32 that in notation. We are given a distribution π. The “first task” is to find one transition probability matrix P such that π = πP . Often, we want to find several such matrices or kernels, intending to combine them by composition or mixing.

Matrix Multiplication (Composition of Operators)

The distribution of Xn+2 given Xn is given by

Pr( Xn+2 = z|Xn = x) = P (x, y )P (y, z ). ∈ yXS Now we recognize a matrix multiplication. The right hand side is the ( x, z ) entry of the matrix P 2, which we write P 2(x, z ). Carrying the process further we see that k Pr( Xn+k = z|Xn = x) = P (x, z ), where P k(x, z ) denotes the ( x, z ) entry of the matrix P k. We can use these operations together. P kg is the conditional expectation of k g(Xn+k) given Xn, and λP is the marginal distribution of Xn+k when Xn has marginal distribution λ. We also want to use this operation when the transition probability matrices are different. Say P (x, y ) and Q(x, y ) are two transition probability matrices, their product is defined in the obvious way

(P Q )( x, z ) = P (x, y )Q(y, z ). ∈ yXS We met this object in Chapter 1 under the name of the composition of P and Q, which we wrote as P Q , anticipating that it would turn out to be a matrix multiplication. The reason for calling it “composition” is that it is functional composition when we think of P and Q as linear operators. Obviously, (P Q )g = P (Qg ). This translates to

RP Q = RP ◦ RQ (2.8a) when we use the notation RP for the linear operator f 7→ P f . It translates to

LP Q = LQ ◦ LP (2.8b) when we use the notation LP for the linear operator λ 7→ λP . In both cases matrix multiplication represents functional composition, but note that P and Q appear in opposite orders on the right hand sides of (2.8a) and (2.8b), the reason being the difference between right and left multiplication.

Convex Combinations of Matrices (Mixing) Besides multiplication of matrices, linear algebra also defines the operations of matrix addition and multiplication of a matrix by a scalar. Neither of these CHAPTER 2. BASIC MARKOV CHAIN THEORY 33 operations turns a Markov matrix into a Markov matrix, because matrix addi- tion loses property (2.2b) and multiplication by a negative scalar loses property (2.2a). If we use both operations together, we can get an operation that preserves Markovness. Transition probability matrices are elements of the vector space RS×S, a d2-dimensional vector space if the state space S has d elements. Ad- dition of matrices is just vector addition in this vector space. Multiplication of a matrix by a scalar is just scalar multiplication in this vector space. If P1, . . . , Pk are elements of any vector space, and a1, . . . , ak are scalars, then

P = a1P1 + · · · + akPk (2.9) is called a linear combination of the Pi. If the ai also satisfy i ai = 1, a linear combination is called an affine combination . If the ai also satisfy ai ≥ 0 for each i, an affine combination is called a convex combination . P For Markov matrices P1, . . . , Pk, • if P in (2.9) is Markov, then linear combination is affine, • conversely, if the linear combination is convex, then P is Markov. (Exercise 2.2). Convex combinations correspond exactly to the operation of mixing of up- date mechanisms (also called “random scan”) described in Section 1.7. if there are k update mechanisms, the i-th mechanism described by transition probabil- ity matrix Pi, and we choose to execute the i-the mechanism with probability ai, then the transition probability matrix for the combined update mechanism is given by (2.9). In order to be probabilities the ai must be nonnegative and sum to one, which is exactly the same as the requirement for (2.9) to be a convex combination. We would have called this notion “convex combination” rather than “mixture,” but that seemed too long for everyday use.

2.2.2 General State Space Now we turn to general state spaces, and kernels replace matrices. The objects on which the kernels operate on the left and right now are very different, a function on the state space (an object for right multiplication) is not at all like a measure on the state space (and object for left multiplication).

Signed Measures In the discrete case we wanted to talk about measures that were not proba- bility measures. We need a similar notion for general state spaces. A real-valued measure on a measurable space 1 (S, B) is a function µ : B → R that is countably additive. 1A measurable space is a pair ( S, B) consisting of a set S, in this case the state space, and a σ-field of subsets of S. The elements of B are called the measurable sets or, when we are talking about probabilities, events . So B is just the set of all possible events. CHAPTER 2. BASIC MARKOV CHAIN THEORY 34

Although not part of the definition, it is a theorem of real analysis that µ is actually a bounded function (Rudin 1987, Theorem 6.4), that is, there are constants a and b such that a ≤ µ(B) ≤ b for all B ∈ B . If µ(B) ≥ 0 for all measurable sets B, then we say µ is a positive measure. The general case, in which µ(B) takes values of both signs, is sometimes called a real signed measure, although strictly speaking the “signed” is redundant. Another theorem (Rudin 1987, Theorem 6.14) says that there exists a par- 2 tition of the state space into two measurable sets A1 and A2 such that

µ(B) ≤ 0, B ⊂ A1

µ(B) ≥ 0, B ⊂ A2 This is called the Hahn decomposition of the state space S. Then the measures µ+ and µ− defined by

− µ (B) = −µ(B ∩ A1), B ∈ B + µ (B) = µ(B ∩ A2), B ∈ B are both positive measures on S and they are mutually singular. Note that µ = µ+ − µ−, which is called the Jordan decomposition of µ. It is entirely analogous to the decomposition f = f + − f − of a function into its positive and negative parts. The measure |µ| = µ+ + µ− is called the total variation of µ. And kµk = |µ|(S) is called the total variation norm of µ. Let M(S) denote the set of all real signed measures on S. From the Jordan decomposition, we see that every element of M(S) is a difference of positive finite measures, hence a linear combination of probability measures. Thus M(S) is the vector space spanned by the probability measures. Hence it is the proper replacement for RS in our discussion of left multiplication in the discrete case.

Norms and Operator Norm For any vector space V , a function x 7→ k xk from V to [0 , ∞) is called a norm on V if it satisfies the following axioms (Rudin 1987, p. 95) (a) kx + yk ≤ k xk + kyk for all x, y ∈ V , (b) kax k = |a| · k xk for all a ∈ R and x ∈ V , and (c) kxk = 0 implies x = 0. Axiom (a) is called the triangle inequality . The pair ( V, k · k ) is called a normed vector space or a normed linear space . Total variation norm makes M(S) a normed vector space. We do need to verify that total variation norm does satisfy the axioms for a norm (Exercise 2.3). Denote the set of all linear operators on a vector space V by L(V ). Then L(V ) is itself a vector space if we define vector addition by (S + T )( x) = S(x) + T (x), S,T ∈ L(V ), x ∈ V (2.10a)

2 Partition means A1 ∩ A2 = ∅ and A1 ∪ A2 = S CHAPTER 2. BASIC MARKOV CHAIN THEORY 35 and scalar multiplication by

(aT )( x) = aT (x), a ∈ R, T ∈ L(V ), x ∈ V. (2.10b)

These definitions are the obvious ones, arrived at almost without thinking. How else would you define the sum of two functions S and T except as the sum (2.10a)? When V is normed, there is a natural corresponding norm for L(V ) defined by kT x k kT k = sup (2.11) x∈V kxk x=06 Or, more precisely, we should say that (2.11) defines a norm for the subset of L(V ) consisting of T such that (2.11) is finite. We denote that subset B(V ), and call its elements the bounded operators on L(V ). The bounded operators are the well behaved ones. A normed linear space is also a metric space, the metric being defined by d(x, y ) = kx − yk. Hence we can discuss topological notions like continuity and convergence of sequences. A sequence {xn} in V converges to a point x if kxn − xk → 0. An operator T ∈ L(V ) is continuous at a point x if T x n → T x (meaning kT x n − T x k → 0) for every sequence {xn} converging to x. Since T x n − T x = T (xn − x) by linearity, a linear operator T is continuous at x if and only if it is continuous at zero. Thus linear operators are either everywhere continuous or nowhere continuous. A linear operator T is continuous if and only if it is bounded (Rudin 1991, Theorem 1.32). Thus the unbounded operators are nowhere continuous, a fairly obnoxious property. If V is finite-dimensional, then every operator in L(V ) is bounded (Halmos 1958, p. 177). But if V is infinite-dimensional, there are lots of unbounded operators. Let’s check that operator norm satisfies the norm axioms. Essentially it satisfies the axioms because vector norm does. For the triangle inequality

kSx + T x k kS + T k = sup x∈V kxk x=06 kSx k + kT x k ≤ sup x∈V kxk x=06 kSx k kT y k ≤ sup + sup x∈V kxk y∈V kyk x=06 y=06 = kSk + kT k

The first inequality is the triangle inequality for the vector norm. The second inequality is subadditivity of the supremum operation. For any functions f and g on any set S f(x) + g(x) ≤ f(x) + sup g(y), y∈S CHAPTER 2. BASIC MARKOV CHAIN THEORY 36 so taking the sup over x gives

sup [f(x) + g(x)] ≤ sup f(x) + sup g(y). x∈S x∈S y∈S

For axiom (b),

kaT x k |a| · k T x k kaT k = sup = sup = akT k. x∈V kxk x∈V kxk x=06 x=06

Finally, for axiom (c), kT k = 0 only if kT x k = 0 for all x ∈ V , but axiom (c) for vector norm implies kT x k = 0 if and only if T x = 0. Thus kT k = 0 implies that T is the operator that maps every x to 0. And this operator is indeed the zero of the vector space L(V ), because then

(S + T )( x) = S(x) + T (x) = S(x) + 0 = S(x), x ∈ V so S + T = S for all S ∈ L(V ), and this is the property that makes T the zero of the vector space L(V ). Operator norm satisfies two important inequalities. The first

kT x k ≤ k T k · k xk (2.12) follows immediately from the definition (2.11). The second involves the notion of operator “multiplication,” which is defined as composition of functions: ST is shorthand for S ◦ T . As we saw above, this agrees with our usual notation in the finite-dimensional case: matrix multipli- cation corresponds to functional composition of the corresponding operators. With this notion of multiplication B(V ) becomes an operator algebra . A vector algebra , also called linear algebra , is a vector space in which a multiplication is defined. The reason the subject “linear algebra” is so called is because matrices form a vector algebra. The second important inequality is

kST k ≤ k Sk · k T k. (2.13)

I call (2.13) the Banach algebra inequality because it is one of the defining properties of a Banach algebra. Since we will have no need of Banach algebras in this course, it is a really horrible name. Maybe we should call it the mumble mumble inequality. Whatever we call it, the proof is a trivial consequence of operator “multiplication” actually being functional composition.

kS(T x )k kSk · k T x k kST k = sup ≤ sup = kSk · k T k x∈V kxk x∈V kxk x=06 x=06 where the inequality is just (2.12). CHAPTER 2. BASIC MARKOV CHAIN THEORY 37

Left Multiplication

If λ is a probability measure on the state space, and Xn−1 has distribution λ, then the distribution of Xn is given by

λP (A) = λ(dx )P (x, A ). (2.14) Z This is no longer a matrix multiplication, but it does define a linear operator, because integration is a linear operation. Using the Jordan decomposition, we see that (2.14) makes sense for any λ ∈ M (S). Hence (2.14) defines a linear operator on M(S). The next question to answer is whether it is a well-behaved operator, that is, whether it is bounded. In fact, it is. For any Markov kernel P , let LP denote the linear operator on M(S) defined by λ 7→ λP . Then kLP k = 1 (Exercise 2.5). As was the case for discrete state spaces, a probability measure π is invariant for a transition probability kernel if and only if π = πP . This is an integral equation π(B) = π(dx )P (x, B ), B ∈ B Z but we do not usually attempt to find a P that satisfies this equation by direct means. Usually we exploit some trick (if this is mysterious, it will all become clear in the next chapter).

Function Spaces Before we can define the analog to right matrix multiplication, we must decide what space the linear operator f 7→ P f is to act upon. There are a number of possibilities. The ones we will consider are the so-called Lp(π) spaces, where 1 ≤ p ≤ ∞ and π is a probability measure. The Lp(π) norm of a real-valued measurable function f on the probability space ( S, B, π ) is defined by

1/p p kfkp = |f(x)| π(dx ) Z  when 1 ≤ p < ∞. The vector space Lp(π) is the set of all measurable functions p f on ( S, B) such that kfkp < ∞. It is easy to see that the L (π) norm satisfies axiom (b) for norms. That it satisfies axiom (a) is a well-known inequality called Minkowski’s inequality (Rudin 1987, Theorem 3.5). It is also easy to p see that the L (π) norm fails to satisfy axiom (c), since kfkp = 0 only implies π{| f(X)| > 0} = 0. If S is not discrete, there must be nonempty sets of probability zero, and any function f that is zero except on a set of probability zero has kfkp = 0. In order to make Lp(π) a normed vector space, we need to work around this problem by redefining equality in Lp(π) to mean equal except on a set of probability zero. Then axiom (c) is satisfied too, and Lp(π) is a legitimate normed vector space. CHAPTER 2. BASIC MARKOV CHAIN THEORY 38

We also redefine what we mean by inequalities as well. The statement f ≤ g only means f(x) ≤ g(x) except on a set of probability zero, and similarly for the other inequality relations. The space L∞(π) consists of the bounded elements of Lp(π), that is |f| ≤ c for some real number c. Following the conventions for Lp spaces, this only means |f(x)| ≤ c except on a set of probability zero. The L∞(π) norm is the smallest c that will work

kfk∞ = inf { c > 0 : π{| f(X)| > c } = 0 }

This is also now easily seen to satisfy the axioms for norms, axiom (c) holding because we consider f = 0 if it is zero except on a set of probability zero. Thus all the Lp(π) spaces for 1 ≤ p ≤ ∞ are normed vector spaces 3. An useful fact about Lp(π) spaces is that 1 ≤ p ≤ q ≤ ∞ implies Lp(π) ⊃ Lq(π) (Exercise 2.12). (Warning: this uses the fact that π is a bounded measure. It is not true otherwise. However, we will be interested only in the case where π is a probability measure.)

Right Multiplication We are finally ready to define “multiplication” of a kernel on the right by a function. If f is any nonnegative measurable function on ( S, B),

P f (x) = P (x, dy )f(y) (2.15) Z is well-defined, though possibly + ∞. So we have no trouble defining “right multiplication” for nonnegative functions. General functions are a bit more tricky. The issue is whether we can even define P f for f that are both positive and negative. The trouble is that we want f to be integrable with respect to an infinite collection of probability measures, P (x, · ), x ∈ S. It turns out that we get everything we need, if π is an invariant probability measure for a transition probability kernel P and we use integrability with respect to π as our criterion. For f ∈ L1(π), define

g(x) = P (x, dy )|f(y)|. Z Then

π(dx )g(x) = π(dx )P (x, dy )|f(y)| Z ZZ (2.16) = π(dx )|f(y)| Z = kfk1

3Actually they are Banach spaces, a Banach space being a complete normed vector space, where complete means every Cauchy sequence converges. But that will not play any role in the theory used in this course. CHAPTER 2. BASIC MARKOV CHAIN THEORY 39 because π = πP . The interchange of the order of integration going from line 2 to line 3 is the conditional Fubini theorem (Fristedt and Gray 1997, Theorem 2 of Chapter 22). Hence the set B = { x ∈ S : g(x) < ∞ } . satisfies π(Bc) = 0, because if g were infinite on a set of positive probability, the integral (2.16) would be infinite. This means we can define P f (x) by (2.15) for x ∈ B and arbitrarily (say P f (x) = 0) for x ∈ Bc and have a function well defined in the Lp(π) sense. Since Lp(π) ⊂ L1(π) for any p > 1, this makes the map f 7→ P f well-defined on Lp(π) for 1 ≤ p ≤ ∞ . Now we want to show that the linear transformation RP : f 7→ P f actually maps Lp(π) into Lp(π). For x ∈ B and 1 ≤ p < ∞, Jensen’s inequality gives p |P f (x)|p = P (x, dy )f(y) Z

p ≤ P (x, dy )|f(y)| Z When we integrate both sides with respect to π, the fact that the left hand side is not defined for x ∈ Bc does not matter because π(Bc) = 0. Hence

p p kP f kp = π(dx )|P f (x)| Z ≤ π(dx )P (x, dy )|f(y)|p ZZ = π(dy )|f(y)|p

Z p = kfkp Again π = πP and the conditional Fubini theorem were used in going from line 2 to line 3. The case p = ∞ is even simpler, for x ∈ B

|P f (x)| = P (x, dy )f(y) Z

≤ P (x, dy )|f(y)| Z ≤ k fk∞ P (x, dy ) Z = kfk∞

Integrating with respect to π gives kP f k∞ ≤ k fk∞. Thus we see that for 1 ≤ p ≤ ∞ the linear transformation RP : f 7→ P f maps Lp(π) into Lp(π) and the corresponding operator norm satisfies

kRP fkp kRP kp = sup ≤ 1. (2.17) f∈Lp(π) kfkp f=06 CHAPTER 2. BASIC MARKOV CHAIN THEORY 40

In fact kRP kp = 1 because for f ≡ 1,

P f (x) = P (x, dy ) = 1 = f(x) Z so kP f kp = kfkp for constant functions and the supremum in (2.17) is actually equal to one. This has been an important section, so we summarize our results. If f is a measurable function from the state space to [0 , ∞], then P f (x) is well defined, though it may have the value + ∞. Since the set of functions on which this operation is defined is not a vector space, we cannot call P a linear operator here, but this notion is useful in various places in the theory of Markov chains. If a kernel P has an invariant distribution π and f ∈ Lp(π) for some p ≥ 1, p then P f is a well defined element of L (π). The linear operator RP : f 7→ P f is a bounded operator on Lp(π) having operator norm equal to one.

General Kernels In discrete state spaces, we wanted to discuss matrices that were not nec- essarily Markov. We need the analogous definitions for kernels. If ( S, B) is a measurable space, then a map K from S × B to R is a kernel if

• for each fixed x the function B 7→ K(x, B ) is a real signed measure, and

• for each fixed B the function x 7→ K(x, B ) is a measurable function.

Multiplication of Kernels The operation on kernels that is analogous to matrix multiplication is defined by

(K1K2)( x, A ) = K1(x, dy )K2(y, A ). Z Kernel multiplication is associative,

(K1K2)K3 = K1(K2K3) (2.18) for any kernels K1, K2, and K3, by the conditional Fubini theorem (Fristedt and Gray 1997, Theorem 2 of Chapter 22). Kernel multiplication is not, in general, commutative: K1K2 = K2K1 may be false. All of the results for composition and mixing of transition operators that we described in the discrete case carry over unchanged to the general case. In par- ticular, multiplication of kernels corresponds to composition of operators (also called “fixed scan”) in just the same way as we saw in (2.8a) and (2.8b). And a convex combination of Markov operators again produces a Markov operator and still corresponds to the operation of choosing an update mechanism at random and applying it (also called “random scan”). CHAPTER 2. BASIC MARKOV CHAIN THEORY 41

The Identity Kernel The identity element any of the kernel operations is indeed the identity kernel defined back in (2.6). The identity kernel has connections with other notations widely used in probability. For fixed x, the measure I(x, · ) is the probability measure concentrated at x, sometimes written δx, sometimes called the Dirac measure . For fixed A, the function I( · , A ) is the indicator of the set A, more commonly written 1 A. The identity kernel is the identity for kernel multiplication because

(IK )( x, A ) = I(x, dy )K(y, A ) = δx(dy )K(y, A ) = K(x, A ), Z Z and

(KI )( x, A ) = K(x, dy )I(y, A ) = K(x, dy )1 A(y) = K(x, dy ) = K(x, A ). Z Z ZA For this reason, we define K0 = I for any kernel K. Then the so-called Chapman-Kolmogorov equation

Kn = KmKn−m holds whenever 0 ≤ m ≤ n as a direct consequence of the associative law (2.18). The identity kernel is the identity for left multiplication of a kernel by a signed measure because

(λI )( A) = λ(dx )I(x, A ) = λ(dx )1 A(x) = λ(dx ) = λ(A) Z Z ZA It is the identity for right multiplication of a kernel by a function because

(If )( x) = I(x, dy )f(y) = δx(dy )f(y) = f(x). Z Z Needless to say, the operators LP : λ 7→ λP and RP : f 7→ P f are the identity operators on the relevant vector spaces when P is the identity kernel. The identity kernel is Markov, because, as we have seen I(x, · ) is a proba- bility measure, δx, for each x. If Xn ∼ δx, then Xn+1 ∼ δx, because δxI = δx. Hence the chain never moves. Thus the identity kernel is the transition proba- bility for the “maximally uninteresting chain” described in Example 1.4.

2.2.3 Hilbert Space Theory Inner Product Spaces An inner product on a complex vector space V is a map from V ×V to C, the value for the ordered pair of vectors x and y being written ( x, y ), that satisfies the following axioms (Halmos 1958, p. 121) (a) ( x, y ) = (y, x ), CHAPTER 2. BASIC MARKOV CHAIN THEORY 42

(b) ( ax + by, z ) = a(x, z ) + b(y, z ), for a, b ∈ C,

(c) ( x, x ) ≥ 0, and (d) ( x, x ) = 0 implies x = 0. where the overline in (a) denotes complex conjugation. An inner product space is a vector space equipped with an inner product. For the most part, we will only be interested in real inner product spaces, in which case the complex conjugation in (a) does nothing and the scalars in (b) must be real. Since in applications we have no complex numbers, why should the theory involve them? The answer is eigenvalues and eigenvectors. Transition probability matrices are nonsymmetric and hence may have complex eigenvalues even though all their entries are real. So we will not be able to avoid mentioning complex inner product spaces. However, we will see they play a very minor role in Markov chain theory. An inner product space is also a normed vector space with the norm defined by kxk = (x, x ). It is easily verified that the norm axioms are implied by the inner product axioms (Exercise 2.6), the only bit of the proof that is nontrivial p being the triangle inequality, which follows directly from

|(x, y )| ≤ k xk · k yk, which is known to statisticians as the Cauchy-Schwarz inequality. It, of course, is proved exactly the same way as one proves that correlations are between −1 and 1.

Hilbert Spaces A Hilbert space is a complete inner product space, where complete means every Cauchy sequence converges, a sequence {xn} being Cauchy if kxm−xnk → 0 as min( m, n ) → ∞ . We will not develop any of the consequences of this definition, since they are well beyond the level of real analysis taken by most statistics graduate students, but we will steal a few results here and there from Hilbert space theory, explaining what they mean but blithely ignoring proofs. One important fact about Hilbert space theory is the existence of the adjoint of an operator, which is analogous to the transpose of a matrix. If T is a bounded operator on a Hilbert space H. Then there is a unique bounded operator T ∗ on H that satisfies (x, T y ) = ( T ∗x, y ), x,y ∈ H (Rudin 1991, Section 12.9). T ∗ is called the adjoint of T . If T ∗ = T , then T is said to be self-adjoint . To see the connection between adjoints and transposes, equip the vector space RS for some finite set S with the usual inner product

(f, g ) = f(x)g(x). (2.19) ∈ xXS CHAPTER 2. BASIC MARKOV CHAIN THEORY 43

A linear operator on RS is represented by a matrix M(x, y ), the linear oper- ator being TM : f 7→ Mf (the same as the right multiplication we studied in Section 2.1.1 but with M not necessarily a transition probability matrix). Then

(f, T M g) = f(x)M(x, y )g(y) ∈ ∈ xXS yXS and ∗ ∗ (TM f, g ) = g(x)M (x, y )f(y) ∈ ∈ xXS yXS ∗ ∗ ∗ where M is the matrix that represents TM . Clearly, M and M are transposes of each other. For Markov chain theory, there are only two important Hilbert spaces. The first we have already met: L2(π) is a Hilbert space when the inner product is defined by (f, g ) = f(x)g(x)π(dx ). (2.20) Z That this defines an inner product (with the usual proviso that equality means only equality with probability one) is obvious. The completeness comes from the fact that every Lp(π) is a complete metric space (Rudin 1987, Theorem 3.11). Usually we consider Lp(π) a real Hilbert space, in which case the complex conjugate in (2.20) does nothing. The reason why L2(π) is so important is that (2.20) is Cov {f(X), g (X)} in the special case when both variables have mean zero. In order to cater to this special case of interest to statisticians, we introduce the subspace of L2(π) that consists of mean-zero functions

2 2 L0(π) = f ∈ L (π) : f(x)π(dx ) = 0  Z  2 Another characterization of L0(π) uses the notion of orthogonality. Vectors x and y in a Hilbert space are orthogonal if ( x, y ) = 0. If 1 represents the constant function equal to 1 almost surely, then we can also write

2 2 L0(π) = f ∈ L (π) : ( f, 1) = 0

2 2 Thus L0(π) is the subspace of L (π) orthogonal to the constant functions. Since 2 the linear function f 7→ (f, 1) is continuous, L0(π) is a topologically closed subspace of L2(π) and hence is also a Hilbert space.

Warning: The characterization of the adjoint as the transpose is incorrect for L2(π) even in the finite state space case. The reason is that (2.19) is not the inner product on L2(π). The inner product is defined by (2.20). The same formula applies to finite state spaces as for general state spaces (general includes finite). Exercise 2.9 derives the correct formula for the adjoint. CHAPTER 2. BASIC MARKOV CHAIN THEORY 44

In the preceding section, we saw that the operator norm for the linear op- erator f 7→ P f is exactly equal to one, no matter which Lp(π) we have the 2 2 operator act on. The Hilbert space L (π) is no exception, but L0(π) is differ- ent. Reducing the domain of the operator cannot increase the norm, but may decrease it, the supremum in (2.17) being over a smaller set. The proof that the norm is exactly one no longer applies, because it used the fact that P f = f for constant functions f, and those functions are no longer in the domain. Thus 2 when we consider RP : f 7→ P f an operator on L0(π) we have kRP k2 ≤ 1 with strict inequality now a possibility.

2.2.4 Time-Reversed Markov Chains The measure-theoretic construction of infinite sequences of random vari- ables discussed in Section 2.1.3, says that specification of the probability dis- tribution of an infinite sequence is equivalent to specifying a consistent set of finite-dimensional distributions. This allows us to specify a stationary Markov chain as a doubly infinite sequence . . . , X−2, X−1, X0, X1, X2, . . . . Specifying the distribution of the doubly infinite sequence is the same as specifying the joint distribution of Xn, Xn+1 , . . . , Xn+k for any k > 0. Stationarity implies that this joint distribution does not depend on n. Two questions naturally arise about the time-reversed sequence. First, is it Markov? Second, what is its kernel? That the time-reversed sequence has the Markov property is a trivial consequence of conditional independence being a symmetric property, that is, the following three statements are equivalent.

• The future is independent of the past given the present.

• The past is independent of the future given the present.

• The past and future are independent given the present.

If this isn’t mathy enough for you, here are some equations. What is to be shown is that

E{f(Xn+1 , X n+2 ,... )g(Xn−1, X n−2,... )|Xn}

= E{f(Xn+1 , X n+2 ,... )|Xn}E{g(Xn−1, X n−2,... )|Xn} (2.21) for any functions f and g such that both sides are well defined. This says the σ-field generated by Xn+1 , Xn+2 , . . . (the future) and the σ-field generated by Xn−1, Xn−2, . . . (the past) are conditionally independent given the σ-field generated by Xn (the present) (Fristedt and Gray 1997, Definition 23 of Chap- ter 21). CHAPTER 2. BASIC MARKOV CHAIN THEORY 45

The proof is

E{f(Xn+1 , X n+2 ,... )g(Xn−1, X n−2,... )|Xn}

= E{E[f(Xn+1 , X n+2 ,... )g(Xn−1, X n−2,... )|Xn, X n−1, X n−2,... ]|Xn}

= E{g(Xn−1, X n−2,... )E[f(Xn+1 , X n+2 ,... )|Xn, X n−1, X n−2,... ]|Xn}

= E{g(Xn−1, X n−2,... )E[f(Xn+1 , X n+2 ,... )|Xn]|Xn}

= E{f(Xn+1 , X n+2 ,... )|Xn}E{g(Xn−1, X n−2,... )|Xn}

The equality between lines 3 and 4 is the Markov property of the original chain running forwards in time. The other equalities are standard properties of con- ditional expectation. The equalities between lines 2 and 3 and between lines 4 and 5 are the property that functions of the conditioning variables can be taken outside a conditional expectation (Fristedt and Gray 1997, Problem 27 of Chap- ter 23). The equality between lines 1 and 2 is the general iterated conditional expectation formula (Fristedt and Gray 1997, Proposition 6 of Chapter 23). By Propositions 25 and 27 of Chapter 23 in Fristedt and Gray (1997) (2.21) implies the Markov property for the time-reversed chain

E{1A(Xn−1)|Xn, X n+1 , X n+2 ,... } = E{1A(Xn−1)|Xn}.

Clearly, the time-reversed chain is also stationary, in particular, it has sta- tionary transition probabilities. As to whether these transition probabilities are representable by a kernel, the answer is not necessarily, but usually. The issue is whether there exists a kernel P ∗ satisfying

π(dx )P ∗(x, B ) = π(dx )P (x, A ), A, B ∈ B , (2.22) ZA ZB (where B is the σ-field of the state space), that is, whether P ∗ exists as a regular conditional probability. Conditional probabilities always exist, but regular ones do not. The key is whether the state space is “nice” enough. If the state space is a so-called Borel space , then regular conditional probabilities (a. k. a. kernels) exist (Fristedt and Gray 1997, Theorem 19 of Chapter 21). Euclidean spaces Rd are Borel spaces, as are most (all?) other state spaces that arise in practical examples. So we may take it for granted that P ∗ exists. It is not, however, uniquely defined. P ∗(x, · ) can be defined arbitrarily for x in a set of π-probability zero without effecting (2.22). Thus there are many kernels P ∗, all of which give the same probability law for the time-reversed chain. Now that we have a kernel P ∗ for the time-reversed chain, we know that ∗ P and the marginal distribution π of Xn, which is invariant for both P and P ∗, determine the probability distribution of the infinite sequence. We can also look at P ∗ as an operator. In particular, (2.22) is equivalent to

π(dx )P ∗(x, dy )f(x)g(y) = π(dx )P (x, dy )g(x)f(y), f,g ∈ L2(π) Z Z (2.23) CHAPTER 2. BASIC MARKOV CHAIN THEORY 46 by linearity of expectation and monotone convergence. In Hilbert space notation (2.23) is (f, P ∗g) = ( P f, g ) so now we see why the choice of P ∗ for the kernel of the time-reversed chain. It is the adjoint operator on L2(π).

2.2.5 Reversibility A stationary Markov chain is reversible (also called time-reversible ) if the doubly infinite sequence has the same probability distribution when time is reversed. We also say a kernel P is reversible with respect to π if (2.22) holds with P ∗ = P , that is,

π(dx )P (x, B ) = π(dx )P (x, A ), A, B ∈ B . (2.24) ZA ZB Taking the case where A is the whole state space in (2.24) gives

π(dx )P (x, B ) = π(dx ) = π(B), B ∈ B , Z ZB which says πP = π. Thus (2.24) implies that π is invariant for P . This is a very important principle.

If P is reversible with respect to π, then P preserves π.

This will turn out to be our main method for accomplishing the “first task” of MCMC. Given a distribution π, how do we find Markov update mechanisms that preserve π? Answer: show they are reversible with respect to π. If (2.24) holds, then so does (2.23) with P ∗ = P , that is,

f(x)g(y)π(dx )P (x, dy ) = g(x)f(y)π(dx )P (x, dy ), f,g ∈ L2(π). ZZ ZZ (2.25) Hence P is self-adjoint.

P is reversible with respect to π, if and only if P is a self-adjoint operator on L2(π). We can rewrite (2.24) as

Pr( Xn ∈ A & Xn+1 ∈ B) = Pr( Xn ∈ B & Xn+1 ∈ A) (2.26)

This gives yet another slogan.

A stationary Markov chain is reversible, if and only if Xn and Xn+1 are exchangeable. CHAPTER 2. BASIC MARKOV CHAIN THEORY 47

For a discrete state space, transition probability matrix P and invariant distribution π, and state space S, the reversibility property is

Pr( Xn = x & Xn+1 = y) = Pr( Xn = y & Xt+1 = x), or stated in terms of π and P π(x)P (x, y ) = π(y)P (y, x ), x,y ∈ S, (2.27) a condition that is referred to as detailed balance . Our main tool for establishing that a particular transition probability P has a specified invariant distribution π will be verification of the detailed balance condition (2.27) and its counterparts for general state spaces. This is generally much easier than verifying πP = π directly. The analogue of (2.27) for general state spaces (2.26) involves probabilities of sets rather than points, and so does not lead to an analog of the detailed balance condition. You will sometimes see π(dx )P (x, dy ) = π(dy )P (y, dx ) called “detailed balance for general state spaces,” but strictly speaking this is merely a shorthand for (2.24) or (2.25).

Exercises

2.1. Find an invariant distribution and show that it is unique for (a) The random walk with reflecting barriers, Example 2.1. (b) The modification of random walk with reflecting barriers, so that the first row of the transition probability matrix is 0 , 1, 0,... and the last row is modified similarly to ..., 0, 1, 0, the rest of the rows remaining as in (2.3). 2.2. (a) Show that a linear combination of Markov transition operators is Markov if and only if the linear combination is an affine combination. (b) Provide a counterexample that shows an affine combination of Markov tran- sition operators that is not a convex combination but is still Markov. 2.3. Show that total variation norm satisfies the norm axioms.

2.4. Show that the map LP : λ 7→ λP is a linear operator on M(S) when P is a Markov kernel. There are two things to show, first that LP is a linear transformation

LP (aλ + bµ ) = aL P (λ) + bL P (µ), a,b ∈ R, λ, µ ∈ M (S), and second that LP maps M(S) to M(S) (that is, λP is a countably additive set function). CHAPTER 2. BASIC MARKOV CHAIN THEORY 48

2.5. Show that the map LP : λ 7→ λP satisfies kLP k = 1 when P is a Markov kernel.

2.6. Show that kxk = (x, x ) defines a norm, when ( x, y ) is an inner product. Include a proof of the Cauchy-Schwarz inequality for inner product spaces. p 2.7. Show that the stationary scalar-valued AR(1) time series discussed in Examples 1.2 and 1.5 is reversible. 2.8.

(a) Show that the random walk with reflecting barriers of Example 2.1 is re- versible.

(b) Show that the modified random walk of Problem 2.1 (b) is reversible.

(c) Show that the “maximally uninteresting chain” having the identity kernel as its kernel is reversible for any invariant distribution π.

2.9. Suppose P is a transition probability matrix on a finite state space S having invariant distribution π considered as a vector π ∈ RS. Find the formula 2 for the adjoint of RP : f → P f considered as an operator on L (π). 2.10. Find a Markov chain transition probability kernel that is not reversible.

2.11. Show that the Gibbs update described in Section 1.7 is reversible.

2.12. If π is a probability measure, show that 1 ≤ p ≤ q ≤ ∞ implies Lp(π) ⊃ Lq(π).