Basic Markov Chain Theory
Total Page:16
File Type:pdf, Size:1020Kb
Chapter 2 Basic Markov Chain Theory To repeat what we said in the Chapter 1, a Markov chain is a discrete-time stochastic process X1, X2, . taking values in an arbitrary state space that has the Markov property and stationary transition probabilities: • the conditional distribution of Xn given X1, . , Xn−1 is the same as the conditional distribution of Xn given Xn−1 only, and • the conditional distribution of Xn given Xn−1 does not depend on n. The conditional distribution of Xn given Xn−1 specifies the transition proba- bilities of the chain. In order to completely specify the probability law of the chain, we need also specify the initial distribution , the distribution of X1. 2.1 Transition Probabilities 2.1.1 Discrete State Space For a discrete state space S, the transition probabilities are specified by defining a matrix P (x, y ) = Pr( Xn = y|Xn−1 = x), x,y ∈ S (2.1) that gives the probability of moving from the point x at time n − 1 to the point y at time n. Because of the assumption of stationary transition probabilities, the transition probability matrix P (x, y ) does not depend on the time n. Some readers may object that we have not defined a “matrix.” A matrix (I can hear such readers saying) is a rectangular array P of numbers pij , i = 1, . , m, j = 1, . , n, called the entries of P . Where is P ? Well, enumerate the points in the state space S = {x1,. , x d}, then pij = Pr {Xn = xj|Xn−1 = xi}, i = 1 ,...,d, j = 1 , . d. I hope I can convince you this view of “matrix” is the Wrong Thing. There are two reasons. 25 CHAPTER 2. BASIC MARKOV CHAIN THEORY 26 First, the enumeration of the state space does no work. It is an irrelevancy that just makes for messier notation. The mathematically elegant definition of a matrix does not require that the index sets be {1, . , m } and {1,...,n } for some integers m and n. Any two finite sets will do as well. In this view, a matrix is a function on the Cartesian product of two finite sets. And in this view, the function P defined by (2.1), which is a function on S × S, is a matrix. Following the usual notation of set theory, the space of all real-valued func- tions on a set A is written RA. This is, of course, a d-dimensional vector space when A has d points. Those who prefer to write Rd instead of RA may do so, but the notation RA is more elegant and corresponds to our notion of A being the index set rather than {1,...,d }. So our matrices P being functions on S ×S are elements of the d2-dimensional vector space RS×S. The second reason is that P is a conditional probability mass function. In most contexts, (2.1) would be written p(y|x). For a variety of reasons, partly the influence of the matrix analogy, we write P (x, y ) instead of p(y|x) in Markov chain theory. This is a bit confusing at first, but one gets used to it. It would be much harder to see the connection if we were to write pij instead of P (x, y ). Thus, in general, we define a transition probability matrix to be a real-valued function P on S × S satisfying P (x, y ) ≥ 0, x,y ∈ S (2.2a) and P (x, y ) = 1 . (2.2b) ∈ yXS The state space S must be countable for the definition to make sense. When S is not finite, we have an infinite matrix. Any matrix that satisfies (2.2a) and (2.2b) is said to be Markov or stochastic . Example 2.1. Random Walk with Reflecting Boundaries. Consider the symmetric random walk on the integers 1, . , d with “reflecting boundaries.” This means that at each step the chain moves one unit up or down 1 with equal probabilities, 2 each way, except at the end points. At 1, the lower 1 end, the chain still moves up to 2 with probability 2 , but cannot move down, there being no points below to move to. Here when it wants to go down, which 1 is does with probability 2 , it bounces off an imaginary reflecting barrier back to where it was. The behavior at the upper end is analogous. This gives a transition matrix 1 1 2 2 0 0 . 0 0 0 1 1 2 0 2 0 . 0 0 0 1 1 0 2 0 2 . 0 0 0 1 0 0 2 0 . 0 0 0 (2.3) . .. . 1 0 0 0 0 . 0 0 2 0 0 0 0 . 1 0 1 2 2 0 0 0 0 . 0 1 1 2 2 CHAPTER 2. BASIC MARKOV CHAIN THEORY 27 We could instead use functional notation 1/2, |x − y| = 1 or x = y = 1 or x = y = d P (x, y ) = (0, otherwise Either works. We will use whichever is most convenient. 2.1.2 General State Space For a general state space S the transition probabilities are specified by defin- ing a kernel P (x, B ) = Pr {Xn ∈ B|Xn−1 = x}, x ∈ S, B a measurable set in S, satisfying • for each fixed x the function B 7→ P (x, B ) is a probability measure, and • for each fixed B the function x 7→ P (x, B ) is a measurable function. In other words, the kernel is a regular conditional probability (Breiman 1968, Section 4.3). Lest the reader worry that this definition signals an impending blizzard of measure theory, let me assure you that it does not. A little bit of measure theory is unavoidable in treating this subject, if only because the major reference works on Markov chains, such as Meyn and Tweedie (1993), are written at that level. But in practice measure theory is entirely dispensable in MCMC, because the computer has no sets of measure zero or other measure-theoretic paraphernalia. So if a Markov chain really exhibits measure-theoretic pathology, it can’t be a good model for what the computer is doing. In any case, we haven’t hit serious measure theory yet. The main reason for introducing kernels here is purely notational. It makes unnecessary a lot of useless discussion of special cases. It allows us to write expressions like E{g(Xn)|Xn−1 = x} = P (x, dy )g(y) (2.4) Z using one notation for all cases. Avoiding measure-theoretic notation leads to excruciating contortions. Sometimes the distribution of Xn given Xn−1 is a continuous distribution on Rd with density f(y|x). Then the kernel is defined by P (x, B ) = f(y|x) dy ZB and (2.4) becomes E{g(Xn)|Xn−1 = x} = g(y)f(y|x) dy. Z CHAPTER 2. BASIC MARKOV CHAIN THEORY 28 Readers who like boldface for “vectors” can supply the appropriate boldface. Since both x and y here are elements of Rd, every variable is boldfaced. I don’t like the “vectors are boldface” convention. It is just one more bit of distinguishing trivial special cases that makes it much harder to see what is common to all cases. Often the distribution of Xn given Xn−1 is more complicated. A common situation in MCMC is that the distribution is continuous except for an atom at x. The chain stays at x with probability r(x) and moves with probability 1 − r(x), and when it moves the distribution is given by a density f(y|x). Then (2.4) becomes E{g(Xn)|Xn−1 = x} = r(x)g(x) + [1 − r(x)] g(y)f(y|x) dy. Z The definition of the kernel in this case is something of a mess r(x) + [1 − r(x)] f(y|x) dy, x ∈ B P (x, B ) = B (2.5) [1 − r(x)] f(y|x) dy, otherwise ( B R This can be simplified by introducingR the identity kernel (yet more measure- theoretic notation) defined by 1, x ∈ B I(x, B ) = (2.6) (0, x∈ / B which allows us to rewrite (2.5) as P (x, B ) = r(x)I(x, B ) + [1 − r(x)] f(y|x) dy. ZB We will see why the identity kernel has that name a bit later. Another very common case in MCMC has the distribution of Xn given Xn−1 changing only one component of the state vector, say the i-th. The Gibbs update discussed in Chapter 1 is an example. The distribution of the i-th component has a density f(y|x), but now x is an element of Rd and y is an element of R (not Rd). Then (2.4) becomes E{g(Xn)|Xn−1 = x} = g(x1,...,x i−1, y, x i+1 ,...,x d)f(y|x) dy. Z The notation for the kernel is even uglier unless we use “probability is a special case of expectation.” To obtain the kernel just take the special case where g is the indicator function of the set B. The virtue of the measure-theoretic notation (2.4) is that it allows us to refer to all of these special cases and many more without getting bogged down in a lot of details that are irrelevant to the point under discussion. I have often wondered why this measure-theoretic notation isn’t introduced in lower CHAPTER 2. BASIC MARKOV CHAIN THEORY 29 level courses. It would avoid tedious repetition, where first we woof about the discrete case, then the continuous case, even rarely the mixed case, thus obscuring what is common to all the cases.