Markov Chains and MCMC Markov chains

I Let S = {1, 2,..., N} be a finite set consisting of N states.

I A Y0, Y1, Y2,... is a sequence of random variables, with Yt ∈ S for all points in time t ∈ N, that satisfies the , namely, given the present state, the future and past states are independent:

Pr(Yn+1 = x|Y0 = y0,..., Yn = yn) = Pr(Yn+1 = x|Yn = yn)

I A Markov chain is homogeneous if for all n ≥ 1:

Pr(Yn+1 = j|Yn = i) = Pr(Yn = j|Yn−1 = i)

i.e., transitional probabilities are time independent.

I There is an N × N transition matrix P such that for all i, j, n VPN we have: Pij := Pr(Yn+1 = j|Yn = i) ≥ 0 j=1 Pij = 1. n×n PN I Any matrix P ∈ R with Pij ≥ 0 and j=1 Pij = 1, for 1 ≤ i, j ≤ N, is called a stochastic matrix. Example

I Let S = {1, 2, 3, 4} and  1/2 1/2 0 0   1/2 1/2 0 0  P =   (1)  1/3 1/6 1/6 1/3  0 0 0 1

I Here is the graphical representation of P, i.e., only the edges with Pij > 0 are drawn:

1/2 1/2 1/2

11/2 2

1/3 1/6

3 1/3 4

1 1/6 Markov chains as Dynamical Systems

I A homogeneous Markov chain defines a : N PN I Let X = {x ∈ R : xi ≥ 0, i=1 xi = 1}: the space of probability vectors over S. 1 PN 1 I d(x, y) := 2 k=1 |xk − yk | = 2 kx − yk1 is a metric on X. I P : X → X defines a linear map by right multiplication. PN I P : x 7→ xP, i.e., (xP)n = k=1 xk Pkn, is well-defined: PN PN PN PN PN I n=1(xP)n = n=1 k=1 xk Pkn = k=1 n=1 xk Pkn = PN PN PN k=1 xk n=1 Pkn = k=1 xk = 1 (n) n I The n-step transition matrix is simply P = P .

I If x ∈ X then the probability vector over S evolves as

orbit of x : x, xP, xP2,..., xPn,...

I Interested in the long term behaviour of orbits. Communicating Classes

I A state j is accessible from i, denoted by i → j, if there (n) exists n ≥ 0 with Pij > 0, i.e., j can be reached from i after a finite number of steps.

I Two states i and j communicate, denoted by i ↔ j, if i → j and j → i.

I Note that for any state i, we have i ↔ i. Why?

I Communication is an equivalence relation, which induces communicating classes.

I By analysing the transition matrix, we determine the communication classes.

I Find the communicating classes of P in Equation (1).

I A Markov chain is irreducible if it has a single communicating class, i.e., if any state can be reached from any state. Aperiodic Markov Chains

I The period of a state i is defined as (n) d(i) := gcd{n ≥ 1 : Pii > 0}. I Here, gcd denotes the greatest common divisor.

I Example: Consider a Markov Chain with transition matrix,

 0 1 0 0   1/2 0 1/2 0  P =   (2)  0 1/2 0 1/2  0 0 1 0

All diagonal elements of P2n are positive and those of P2n+1 are all zero. Thus, every state has period 2.

I If d(i) = 1 then i is said to be aperiodic.

I If all states are aperiodic then the chain is called aperiodic. Irreducibility and Periodicity: basic properties

I If a Markov chain is irreducible then all its states have the same period.

I An irreducible Markov chain is aperiodic if there is a state i with Pii > 0. I An irreducible Markov chain is aperiodic iff there exists (n) n ≥ 1 such that ∀i, j.Pij > 0. I If P is the transition matrix of an irreducible Markov chain and 0 < a < 1, then aI + (1 − a)P is the transition matrix of an irreducible Markov chain, where I is the N × N identity matrix.

I Thus, by choosing a small a > 0, we can allow a small self transition to make the Markov chain aperiodic.

I Moreover, we will see later that P and aI + (1 − a)P have the same stationary distribution. Stationary distribution

I A probability vector π ∈ X is a stationary distribution of a Markov chain if πP = π.

I Any transition matrix has an eigenvalue equal to 1. Why?

I Thus, find the stationary distributions by solving the eigenvalue problem πP = π or PT πT = πT . I Fundamental Theorem of Markov chains. An irreducible and aperiodic Markov chain has a unique stationary distribution π which satisfies: n I limn→∞ xP = π for all x ∈ X. n I limn→∞ P exists and is the matrix with all rows equal to π.

I This means that whatever our starting probability vector, the dynamics of the chain takes it to the unique stationary distribution or the steady-state of the system.

I Thus, we have an attractor π with basin X.

I Check: πP = π ⇐⇒ π(aI + (1 − a)P) = π, for 0 < a < 1. Reversible Chains and Detailed Balanced Condition

I A Markov chain is reversible if there is π ∈ X that satisfies the detailed balanced condition:

πi Pij = πj Pji , for 1 ≤ i, j ≤ N

I This means that Pr(Yn = j, Yn−1 = i) = Pr(Yn = i, Yn−1 = j) when Pr(Yn−1 = i) = πi for all i, i.e., the chain is time reversible. I Exercise: If π satisfies the detailed balanced condition, then it is a stationary distribution.

I This shows that satisfying the detailed balanced condition is sufficient for π to be the stationary distribution.

I However, the detailed balanced condition is not a necessary condition for π to be a stationary distribution. Markov Chain Monte Carlo

I Markov Chain Monte Carlo (MCMC) methods are based on the convergence of the orbit of any initial probability vector to the unique stationary distribution of an irreducible and aperiodic Markov chain. I If we need to sample from a distribution q say on a finite state space, we construct an irreducible and aperiodic Markov chain P with unique stationary distribution π = q. n n I Then, since limn→∞ xP = π, i.e. limn→∞ |xP − π| = 0 for any x ∈ X, it follows that for large n, we have xPn ≈ π = q. 3 xP 4 x xP 2 xP n q xP xP

I Metropolis-Hastings algorithms such as Gibbs sampling used in stochastic Hopfield networks and RBMs are examples of MCMC, as we will see. Stochastic Hopfield Networks

I Replace the deterministic updating xi ← sgn(hi ), where PN hi = j=1 wij xj , with an asynchronous probabilistic rule: 1 1 Pr(xi = ±1) = = (3) 1 + exp(∓2hi /T ) 1 + exp(−2hi xi /T ) where T > 0 is the pseudo temperature. I As T → 0, this is reduced to the deterministic rule.

T=0 sgn 1 1 1/2 1 1 x −x/T 2 Logistic sigmoid map 1 + e T= 1 1/2 x 1/2 −x 1 + e

0

I Exercise. From (3), obtain the probability of flipping: 1 Pr(x → −x ) = , (4) i i 1 + exp(∆E/T ) where ∆E = E 0 − E is the change in energy. I There is now some probability that the energy increases. Stationary Distribution

I A stochastic Hopfield network can be viewed as a Markov N process with 2 states: xi = ±1 for 1 ≤ i ≤ N. I The flipping probability (4) defines the transition matrix P. I Irreducible: The flipping rule is applied asynchronously at all nodes, one at a time, and therefore, there is a non-zero probability of going from any state to any state. I Aperiodic: There is a non-zero probability that at any time a state does not change. I Thus, the network has a unique stationary distribution. I Exercise: Show that the distribution exp(−E(x)/T ) π(x) := Pr(x) = , (5) Z where Z is the normalisation factor, satisfies the detailed balanced condition, i.e., it is the stationary distribution. I Start with any configuration of the network; successively apply the flipping rule. After a large number of iterations, we have a sample from the stationary distribution. Computing average values

I Suppose now we are interested to find not a sample from the stationary distribution π but the average value of a function f : X → R on the configurations of a stochastic Hopfield network with respect to π. PN I For example, we may take f (x) = i=1 xi and want to PN compute Eπ( i=1 xi ), i.e., the average value of the sum of all binary node values with respect to the stationary distribution π of the stochastic network.

I As we have seen, we can find a sample of π by starting with any distribution p0 and applying the transitional n probability P a large number of times, n say, to obtain p0P as a sample of π.

I We thus reasonably expect that we can find a good PN estimate for Eπ( i=1 xi ) by computing the expected value N n N P n P of i=1 xi with respect to p0P , i.e., find Ep0P ( i=1 xi ). Convergence of average value

I First recall that given a probability vector p ∈ X and a function f : {1,..., N} → R the average value or expected value of f with respect to p is given by

N X Ep(f ) = pi f (i) i=1 (n) I If a sequence p ∈ X,(n ∈ N) of probability vectors converges to p, then we have:

Ep(n) (f ) → Ep(f ), as n → ∞. (n) I In fact, limn→∞ p = p implies (n) PN (n) limn→∞ kp − pk1 = limn→∞ i=1 |pi − pi | = 0 (n) i.e., limn→∞ |pi − pi | = 0 for each i = 1,..., N. Thus: PN (n) PN Ep(n) (f ) = i=1 pi f (i) → i=1 pi f (i) = Ep(f ) Simulated Annealing

I In optimisation problems, Simulated Annealing is performed in order to avert getting stuck in local minima.

I In Hopfield networks, local minima correspond to spurious patterns with higher energy than the stored patterns.

I Start from a reasonably high value of T so that the states have a good probability of jumping over from the basins of local minima to basins of stored patterns.

I Then steadily lower T so that the states gradually follow a downhill road in the energy landscape to a stable state.

Energy

stored pattern spurious pattern Markov Random Fields I

I For applications in , Computer Vision, Image Processing etc., we need a generalisation of Markov chains to Markov Random Fields (MRF).

I In a Markov chain we have a sequence of random variables satisfying the Markov property that the future is independent of the past given the present.

I In a MRF we have a vector of random variables presented in an undirected graph that describes the conditional dependence and independence of any pair of the random variables given a third one.

I Given a random variable Y , two random variables Y1 and Y2 are independent if

Pr(Y1, Y2|Y ) = Pr(Y1|Y )Pr(Y2|Y ) Markov Random Fields II

I Assume G = (V , E) is an undirected graph such that each node v ∈ V is associated with a random variable Yv taking values in a finite state space S. Thus, S|V | is the set of configurations. I Assume local Markov property: For all A ⊂ V , v ∈ V \ A:

Pr(v|Nv , A) = Pr(v|Nv ),

where Nv denotes the set of neighbours of v in G.

A

N v

v

I (Yv )v∈V is called a Markov . * Gibbs Distribution (Hammersley-Clifford Theorem)

I Recall that a clique of G is a fully connected subgraph.

I Let cl(G) denote the set of cliques of G and |A| the number of elements in any finite set A. |V | I Any strictly positive probability distribution q : S → [0, 1] of the configurations S|V | of a MRF factorises over cl(G).

I This means that for each C ∈ cl(G) there exists a function |C| |V | φC : S → R such that for x ∈ S 1 Y q(x) = φ (x ), Z C C C∈cl(G)

|C| where xC ∈ S denotes the components of x in C and X Y Z = φC(xC), x∈S|V | C∈cl(G)

is the normalisation constant, or the partition function. * Logistic model

I Since φC(xC) > 0 for each clique C, we can define the energy of a clique for state x ∈ X as

E(xC) := − log φC(xC), where   1 Y 1 X q(x) = exp −E(xC) = exp  − E(xC) , Z Z C∈cl(G) C∈cl(G)   X X with Z = exp  − E(xC) . x∈S|V | C∈cl(G) I In a Logistic model, i.e., log linear model, the energy is assumed to have the form: T E(xC) = −wC fC(xC) = −wC · fC(xC),

where the vector wC represents a model parameter at clique C and fC is a clique dependent vector function. I Stochastic Hopfield networks, Boltzmann Machines and RBM’s all have this Logistic model. Gibbs Sampling

I The stochastic Hopfield network can be viewed as a Markov random field, with a fully connected graph. I The asynchronous probabilistic rule for flipping states one at a time is an example of a general method: I Gibbs sampling in a Markov random field with a graph G = (V , E) updates each variable based on its conditional distribution given the state of the other variables. I Assume X = (X1,..., XN ), where V = {1,..., N}, with Xi taking values in a finite set. −E(x) I Suppose π(x) = e /Z is the joint distribution of X. I Assume q is a strictly positive distribution on V . I At each step, pick i ∈ V with probability q(i); sample a new value for Xi based on its conditional probability distribution given the state (xv )v∈V \i of all other variables (Xv )v∈V \i . I This defines a transition matrix for which π has the detailed balanced condition, i.e., is the stationary distribution. I Instead of q, a pre-defined order is usually used for selecting nodes and the result still holds. Gibbs Sampling in Hopfield networks

I Suppose a probability distribution q is given on the N nodes of a Hopfield network with q(i) > 0 for 1 ≤ i ≤ N. I If at each point of time we select node i with probability q(i) and then flip its value according to the flipping probability as in Equation (4), then we are performing Gibbs sampling. N I What is the transition matrix Pxy for x, y ∈ {−1, 1} ? I If H(x, y) ≥ 2 then no transition takes place, i.e., Pxy = 0. I If H(x, y) = 1, then xi 6= yi for some i with 1 ≤ i ≤ N. Then node i is selected with probability q(i) and thus Pxy = q(i)/(1 + exp ∆Ei ) where ∆Ei = E(−xi ) − E(xi ) is the change in the energy.

I And the probability that node i is selected and xi is not flipped is q(i)(1 − 1/(1 + exp ∆Ei )). I Note that the sum of all all these probabilities will add to 1: N N X q(i)  1  X + q(i) 1 − = q(i) = 1. 1 + exp ∆Ei 1 + exp ∆Ei i=1 i=1