4- Markov Chains and MCMC
Total Page:16
File Type:pdf, Size:1020Kb
Markov Chains and MCMC Markov chains I Let S = f1; 2;:::; Ng be a finite set consisting of N states. I A Markov chain Y0; Y1; Y2;::: is a sequence of random variables, with Yt 2 S for all points in time t 2 N, that satisfies the Markov property, namely, given the present state, the future and past states are independent: Pr(Yn+1 = xjY0 = y0;:::; Yn = yn) = Pr(Yn+1 = xjYn = yn) I A Markov chain is homogeneous if for all n ≥ 1: Pr(Yn+1 = jjYn = i) = Pr(Yn = jjYn−1 = i) i.e., transitional probabilities are time independent. I There is an N × N transition matrix P such that for all i; j; n VPN we have: Pij := Pr(Yn+1 = jjYn = i) ≥ 0 j=1 Pij = 1. n×n PN I Any matrix P 2 R with Pij ≥ 0 and j=1 Pij = 1, for 1 ≤ i; j ≤ N, is called a stochastic matrix. Example I Let S = f1; 2; 3; 4g and 0 1=2 1=2 0 0 1 B 1=2 1=2 0 0 C P = B C (1) @ 1=3 1=6 1=6 1=3 A 0 0 0 1 I Here is the graphical representation of P, i.e., only the edges with Pij > 0 are drawn: 1/2 1/2 1/2 11/2 2 1/3 1/6 3 1/3 4 1 1/6 Markov chains as Dynamical Systems I A homogeneous Markov chain defines a dynamical system: N PN I Let X = fx 2 R : xi ≥ 0; i=1 xi = 1g: the space of probability vectors over S. 1 PN 1 I d(x; y) := 2 k=1 jxk − yk j = 2 kx − yk1 is a metric on X. I P : X ! X defines a linear map by right multiplication. PN I P : x 7! xP, i.e., (xP)n = k=1 xk Pkn, is well-defined: PN PN PN PN PN I n=1(xP)n = n=1 k=1 xk Pkn = k=1 n=1 xk Pkn = PN PN PN k=1 xk n=1 Pkn = k=1 xk = 1 (n) n I The n-step transition matrix is simply P = P . I If x 2 X then the probability vector over S evolves as orbit of x : x; xP; xP2;:::; xPn;::: I Interested in the long term behaviour of orbits. Communicating Classes I A state j is accessible from i, denoted by i ! j, if there (n) exists n ≥ 0 with Pij > 0, i.e., j can be reached from i after a finite number of steps. I Two states i and j communicate, denoted by i $ j, if i ! j and j ! i. I Note that for any state i, we have i $ i. Why? I Communication is an equivalence relation, which induces communicating classes. I By analysing the transition matrix, we determine the communication classes. I Find the communicating classes of P in Equation (1). I A Markov chain is irreducible if it has a single communicating class, i.e., if any state can be reached from any state. Aperiodic Markov Chains I The period of a state i is defined as (n) d(i) := gcdfn ≥ 1 : Pii > 0g. I Here, gcd denotes the greatest common divisor. I Example: Consider a Markov Chain with transition matrix, 0 0 1 0 0 1 B 1=2 0 1=2 0 C P = B C (2) @ 0 1=2 0 1=2 A 0 0 1 0 All diagonal elements of P2n are positive and those of P2n+1 are all zero. Thus, every state has period 2. I If d(i) = 1 then i is said to be aperiodic. I If all states are aperiodic then the chain is called aperiodic. Irreducibility and Periodicity: basic properties I If a Markov chain is irreducible then all its states have the same period. I An irreducible Markov chain is aperiodic if there is a state i with Pii > 0. I An irreducible Markov chain is aperiodic iff there exists (n) n ≥ 1 such that 8i; j:Pij > 0. I If P is the transition matrix of an irreducible Markov chain and 0 < a < 1, then aI + (1 − a)P is the transition matrix of an irreducible Markov chain, where I is the N × N identity matrix. I Thus, by choosing a small a > 0, we can allow a small self transition to make the Markov chain aperiodic. I Moreover, we will see later that P and aI + (1 − a)P have the same stationary distribution. Stationary distribution I A probability vector π 2 X is a stationary distribution of a Markov chain if πP = π. I Any transition matrix has an eigenvalue equal to 1. Why? I Thus, find the stationary distributions by solving the eigenvalue problem πP = π or PT πT = πT . I Fundamental Theorem of Markov chains. An irreducible and aperiodic Markov chain has a unique stationary distribution π which satisfies: n I limn!1 xP = π for all x 2 X. n I limn!1 P exists and is the matrix with all rows equal to π. I This means that whatever our starting probability vector, the dynamics of the chain takes it to the unique stationary distribution or the steady-state of the system. I Thus, we have an attractor π with basin X. I Check: πP = π () π(aI + (1 − a)P) = π, for 0 < a < 1. Reversible Chains and Detailed Balanced Condition I A Markov chain is reversible if there is π 2 X that satisfies the detailed balanced condition: πi Pij = πj Pji ; for 1 ≤ i; j ≤ N I This means that Pr(Yn = j; Yn−1 = i) = Pr(Yn = i; Yn−1 = j) when Pr(Yn−1 = i) = πi for all i, i.e., the chain is time reversible. I Exercise: If π satisfies the detailed balanced condition, then it is a stationary distribution. I This shows that satisfying the detailed balanced condition is sufficient for π to be the stationary distribution. I However, the detailed balanced condition is not a necessary condition for π to be a stationary distribution. Markov Chain Monte Carlo I Markov Chain Monte Carlo (MCMC) methods are based on the convergence of the orbit of any initial probability vector to the unique stationary distribution of an irreducible and aperiodic Markov chain. I If we need to sample from a distribution q say on a finite state space, we construct an irreducible and aperiodic Markov chain P with unique stationary distribution π = q. n n I Then, since limn!1 xP = π, i.e. limn!1 jxP − πj = 0 for any x 2 X, it follows that for large n, we have xPn ≈ π = q. 3 xP 4 x xP 2 xP n q xP xP I Metropolis-Hastings algorithms such as Gibbs sampling used in stochastic Hopfield networks and RBMs are examples of MCMC, as we will see. Stochastic Hopfield Networks I Replace the deterministic updating xi sgn(hi ), where PN hi = j=1 wij xj , with an asynchronous probabilistic rule: 1 1 Pr(xi = ±1) = = (3) 1 + exp(∓2hi =T ) 1 + exp(−2hi xi =T ) where T > 0 is the pseudo temperature. I As T ! 0, this is reduced to the deterministic rule. T=0 sgn 1 1 1/2 1 1 x −x/T 2 Logistic sigmoid map 1 + e T= 1 1/2 x 1/2 −x 1 + e 0 I Exercise. From (3), obtain the probability of flipping: 1 Pr(x ! −x ) = ; (4) i i 1 + exp(∆E=T ) where ∆E = E 0 − E is the change in energy. I There is now some probability that the energy increases. Stationary Distribution I A stochastic Hopfield network can be viewed as a Markov N process with 2 states: xi = ±1 for 1 ≤ i ≤ N. I The flipping probability (4) defines the transition matrix P. I Irreducible: The flipping rule is applied asynchronously at all nodes, one at a time, and therefore, there is a non-zero probability of going from any state to any state. I Aperiodic: There is a non-zero probability that at any time a state does not change. I Thus, the network has a unique stationary distribution. I Exercise: Show that the distribution exp(−E(x)=T ) π(x) := Pr(x) = ; (5) Z where Z is the normalisation factor, satisfies the detailed balanced condition, i.e., it is the stationary distribution. I Start with any configuration of the network; successively apply the flipping rule. After a large number of iterations, we have a sample from the stationary distribution. Computing average values I Suppose now we are interested to find not a sample from the stationary distribution π but the average value of a function f : X ! R on the configurations of a stochastic Hopfield network with respect to π. PN I For example, we may take f (x) = i=1 xi and want to PN compute Eπ( i=1 xi ), i.e., the average value of the sum of all binary node values with respect to the stationary distribution π of the stochastic network. I As we have seen, we can find a sample of π by starting with any distribution p0 and applying the transitional n probability P a large number of times, n say, to obtain p0P as a sample of π. I We thus reasonably expect that we can find a good PN estimate for Eπ( i=1 xi ) by computing the expected value N n N P n P of i=1 xi with respect to p0P , i.e., find Ep0P ( i=1 xi ).