Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Probabilistic Modelling

Georgy Gimel’farb

COMPSCI 369 Computational Science

1 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

1 Random walks

2 Markov Chains

3 Bayesian Networks

4 Markov Random Fields

Learning outcomes on probabilistic modelling: Be familiar with basic probabilistic modelling techniques and tools • Be familiar with basic theory notions and Markov chains • Understand the maximum likelihood (ML) and identify problems ML can solve • Recognise and construct Markov models and hidden Markov models (HMMs) • Recognise problems amenable to Monte Carlo algorithms and be able to identify which computational tools can be best used to solve them

Recommended reading: • G. Strang, Computational Science and . Wellesley-Cambridge Press, 2007: Section 2.8 • C. M. Bishop, Pattern Recognition and . Springer, 2006: Chapters 1, 2, 8, 11 • L. Wasserman, All of : A Concise Course of Statistical Inference. Springer, 2004: Chapter 17

2 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

What Is a Random Walk?

• 1D, 2D, 3D, or generally d-D trajectory consisting of successive random steps • Fundamental model for a random process evolving in time • Applications: , , , , . . . • Random walk hypothesis – a financial theory stating that stock market prices evolve as a random walk and thus cannot be predicted on the basis of the past movement.

3 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Random 1D Walk

−3∆ −2∆ −∆ 0 ∆ 2∆ 3∆ 1D grid ......

1D walk P (−∆) = 1 − p P (+∆) = p

• The drunkard’s walk: p = 0.5

• Distance Ln from the origin after n independent steps: • Expected (mean) distance: E[Ln] = n∆(2p − 1) 2 • Distance : V[Ln] = 4n∆ p(1 − p) p p • Standard deviation: sn ≡ V[Ln] = 2∆ np(1 − p) 2 √ • If p = 0.5: E[Ln] = 0; V(Ln) = n∆ (or sn = n∆)

4 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

1D Random Walk - A Few Numerical Examples

Step length ∆ = 1; P (+1) = P (−1) = 0.5 (the drunkard’s walk): √ E[Ln] = 0; sn = n Step n 1 10 100 1, 000 10, 000 100, 000 1, 000, 000 Mean E[Ln] 0 0 0 0 0 0 0 St. d. sn 1 3.2 10 31.6 100 316.2 1, 000 Step length ∆ = 1; P (+1) = 0.64; P (−1) = 0.36: √ E[Ln] = 0.28n; sn = 0.96 n Step n 1 10 100 1, 000 10, 000 100, 000 1, 000, 000 Mean E[Ln] 0.28 2.8 28 280 2, 800 28, 000 280, 000 St. d. sn 0.96 3.0 9.6 30.4 96 303.5 960 Step length ∆ = 1; P (+1) = 0.9; P (−1) = 0.1: √ E[Ln] = 0.8n; sn = 0.6 n Step n 1 10 100 1, 000 10, 000 100, 000 1, 000, 000 Mean E[Ln] 0.8 8 80 800 8, 000 80, 000 800, 000 St. d. sn 0.6 1.9 6 19 60 189.7 600

5 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Pseudocode to Simulate an 1D Random Walk

Unit step: ∆ = ±1; P+1 + P−1 = 1

Pleft ≡ P−1 Pright ≡ P+1

Threshold: T = Pright r = random() //a computed pseudo-random //number: 0 ≤ r ≤ 1 if r ≤ T then move right //∆ = +1 else move left //∆ = −1

Example (P+1 = 0.75): n r ∆ Ln n r ∆ Ln n r ∆ Ln 1 0.84 −1 −1 6 0.20 +1 −2 11 0.48 +1 1 2 0.39 +1 0 7 0.34 +1 −1 12 0.63 +1 2 3 0.78 −1 −1 8 0.77 −1 −2 13 0.36 +1 3 4 0.80 −1 −2 9 0.28 +1 −1 14 0.51 +1 4 5 0.91 −1 −3 10 0.55 +1 0 15 0.95 −1 3

6 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Simulated 1D Random Walk (P+1 = 0.75)

6 Ln E[Ln] 5 4 3 2 1 n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 −1 −2 −3 −4 n 10 20 30 40 50 60 70 80 90 100 ... L 0 6 12 16 16 24 26 32 36 36 ... −5 n E[Ln] 5 10 15 20 25 30 35 40 45 50 ... −6 sn 2.74 3.87 4.74 5.48 6.12 6.71 7.25 7.75 8.22 8.66 ...

7 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

2D and 3D Random Walks

2D walk 3D walk y y

x x

z

8 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Example: 2D Random Walk

9 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Example: 3D Walks http://www.audienceoftwo.com/pics/upload/542px-Walk3d 0.png

10 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Pseudocode to Simulate a 2D Random Walk

Pup ≡ P0,+1

Pleft ≡ P−1,0 Pright ≡ P+1,0 Pright + Pup + Pleft + Pdown = 1

Pdown ≡ P0,−1

Thresholds: T1 = Pright; T2 = T1 + Pup; T3 = T2 + Pleft

r = random() //a computed pseudo-random //number: 0 ≤ r ≤ 1 if r ≤ T1 then move right //∆x = 1; ∆y = 0 else if r ≤ T2 then move up //∆x = 0; ∆y = 1 else if r ≤ T3 then move left //∆x = −1; ∆y = 0 else then move down //∆x = 0; ∆y = −1

11 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Some Properties of Random Walks

• Gambler’s ruin, or recurrence phenomenon: a simple 1D random walk (P−1 = P+1 = 0.5) crosses every point an infinite number of times • Gambler with a finite amount of money playing a fair game against a bank with infinite funds will surely lose! • Probability Pr(d) that a random walk on a d-D hypercubic lattice returns to the origin:

Pr(1) = 1; Pr(2) = 1 Recurrent walks: d ≤ 2

Pr(3) = 0.3405 ... ; Pr(4) = 0.1932 ... Transient walks: d ≥ 3

• Drunkard eventually gets back to his house from the bar if his random walk is on the set of all points in the line or plane with integer coordinates • But in three dimensions, the probability of returning decreases to roughly 34%

12 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Markov Chains

x1 ... xn−1 xn xn+1 ... xN

• 1st-order : a of random variables x1,..., xN with the property: for n = 1,...,N − 1

P (xn+1|x1,..., xn) = P (xn+1|xn)

• Homogeneous Markov chain: the same transition for all n • Transition

α,β=K P = [pαβ]α,β=1

α,β=K ≡ [P (xn+1 = β|xn = α)]α,β=1

13 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Markov Chains

x1 ... xn−1 xn xn+1 ... xN

Invariant for a homogeneous chain:

∗ X ∗ P (xn+1) = P (xn+1|xn)P (xn) xn • A given Markov chain may have more than one invariant distribution Detailed balance – a sufficient (but not necessary) condition of invariance:

∗ ∗ P (xn+1)P (xn|xn+1) = P (xn)P (xn+1|xn)

Reversible Markov chain: if the detailed balance holds for it

14 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Markov Chain: An Example 1D random walk with reflecting barriers: at 1 2 3 4 5 each step n, the chain variable x[n] take values from {1, 2, 3, 4, 5}

 [n+1] [n] α,β=5 Transition matrix P ≡ P (x = β|x = α) α,β=1

 0 1 − p 0 0 0   1 0 1 − p 0 0    =  0 p 0 1 − p 0     0 0 p 0 1  0 0 0 p 0

 (1 − p)3   (1 − p)2  ∗ 1   Invariant p.d. P (x) = 2  p(1 − p)  1+(1−2p)    p2  p3

15 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Markov Chains

: if irrespectively of P (x1) the distribution P (xn) for n → ∞ converges to the required invariant distribution P ∗(x) • A homogeneous Markov chain is ergodic under weak restrictions on the invariant distribution and the transition probabilities • The invariant distribution is called the equilibrium distribution • An ergodic Markov chain has only one equilibrium distribution • Higher-order Markov chains: P (xn+1|x1, . . . , xn) = P (xn+1|xn, . . . , xn−k) • Generally, not necessarily the nearest k dependencies

16 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

First-order Markov Chains

• m-step transition probability of going from state α to state β in m steps: pαβ(m) = P (xn+m = β|xn = α)

• Chapman–Kolmogorov equations: pαβ(m + n) = P pαγ(m)pγβ(n) or γ P P (xk+m = γ|xk = α)P (xk+m+n = β|xk+m = γ) γ P = P (xk+m+n = β, xk+m = γ|xk = α) γ ≡ P (xk+m+n = β|xk = α)

• In the matrix form: P(1) = P by definition; P(n) = Pn; P(m + n) = P(m)P(n)

17 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Simulation of a Homogeneous Markov Chain

• Initial data: the marginal p.d. P0(x) and transition matrix P

• Sample x0 = a from the initial marginal distribution P0(x), i.e. take randomly one of the values according the marginals P0(x)

Example: x ∈ {0, 1, 2}; P0(0) = 0.2; P0(1) = 0.3; P0(2) = 0.5 r = random(); //a pseudo-random number in [0, 1] if r ≤ 0.2 then x = 0; else if r ≤ 0.5 then x = 1; else x = 2

• For n = 0,...,N − 1, sample the next xn+1 = b for the current xn = a from the conditional p.d. Pn+1(x) = P(xn+1 = x|xn = a)

Example: xn = 2; Pn+1(0) = 0.1; Pn+1(1) = 0.2; Pn+1(2) = 0.7 r = random(); //a pseudo-random number in [0, 1] if r ≤ 0.1 then x = 0; else if r ≤ 0.3 then x = 1; else x = 2

18 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Conditional Independence

Z X Y Directed graph (digraph) G = {V,E} of causal relationships • Sets of nodes V and edges E (ordered pairs of nodes) • Directed path between X and Y : a sequence of arrows all pointing in the same direction • Undirected path between X and Y : a sequence of adjacent nodes (ignoring the direction of arrows) • X is an ancestor of Y and Y is a descendant of X if there is a directed path from X to Y (or X = Y ) • Cycle: a directed path starting and ending at the same node • DAG (directed acyclic graph) – a digraph with no cycles • Conditional independence of X and Y given Z: if fX,Y |Z (x, y|z) = fX|Z (x|z)fY |Z (y|z) for all x, y, and z • Equivalently: f(x|y, z) = f(x|z) – that is, given Z, Y provides no extra information about X

19 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Probability and DAGs

Bayesian network – a digraph endowed with a p.d. • Each node corresponds to a • Edges represent statistical dependencies • A bit misleading term as not only Bayesian inference! • (X,Y ) ∈ E means an arrow points from X to Y : X and Y are called adjacent; X is a parent of Y ; and Y is a child of X

Given a DAG G with vertices V = (X1,...,Xk), a distribution P for V with probability function f is Markov to the graph G if

k Y f(v) ≡ f(x1, . . . , xk) = f(xi|πi) i=1

where πi are the parents of Xi • Equivalent notion: G represents P

20 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Probability and DAGs

Markov Condition A distribution P is represented by a DAG G if and only if every variable W is conditionally independent of all the other variables except the parents and descendants of W , given the parents of W

Markov equivalent graphs: imply the same independence relations • Estimate a distribution f consistent with a DAG G, given G and data sets V1,...,Vn from f n n m Q Q Q • Likelihood: L(θ) = f(Vi; θ) = f(Xij|πj; θj) i=1 i=1 j=1

• Xij – the value of the node Xj in the i-th data set Vi • θj – the parameters of the j-th conditional density

• Estimate G, given data V1,...,Vn (A very challenging problem!) • Fit every possible DAG using MLE and use an Aikake information criterion or other methods to choose a DAG

21 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Bayesian Network: An Example

x2 x1 x4 x6

x3 x5 x7

k Q f(x1, . . . , xk) = f(xj|πj) j=1 = fa(x1)fb(x2)fc(x3)fd(x4|x1, x2, x3) × fe(x5|x1.x3)fg(x6|x4)fh(x7|x4, x5) • The name means only that statistical inference for digraphs can also be performed by Bayesian methods • This model is also called the belief network

22 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Bayesian Network: d-Separation

d-separation (directed separation) of the sets A and B of nodes: • By a set S of nodes such that the variables in A and B are independent given the variables in S, e.g.

A = {x1, x2, x3}; B = {x6, x7}; S = {x4, x5}

x2 x2 x1 x4 x6 x1 x4 x6

x3 x3 x5 x7 x5 x7

23 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Bayesian Network: Markov Blanket

Markov blanket of a node j: • The set of its parents, children, and co-parents • It d-separates j from all other nodes, e.g.

MB1 = {x2, x3, x4, x5}; MB2 = {x1, x3, x4}; MB5 = {x1, x3, x4, x7}

x2 x2 x2 x1 x4 x6 x1 x4 x6 x1 x4 x6

x3 x3 x3 x5 x7 x5 x7 x5 x7

24 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Undirected Graphs

Z X Y Undirected graph G = (V,E): a finite set V of nodes (vertices) and a finite set E of edges (arcs) consisting of pairs of vertices • X and Y are adjacent (neighbours) if (X,Y ) ∈ E or (Y,X) ∈ E

• X0,...,Xn is a path if (Xi,Xi+1) is adjacent for each i • Complete graph: there is an edge between every pair of nodes • Subgraph: a subset U ⊂ V of nodes together with their edges • If A, B, and C are three distinct subsets of V , then C separates A and B if every path from a node in A to a node in B intersects a node in C

25 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Markov Graphs

Pairwise Markov graph: • V – a set of random variables with distribution P • One node for each random variable in V • No edge between every pair of variables being independent given the rest of the variables

Theorem (Global Markov condition) Let G = (V,E) be a pairwise Markov graph for a distribution P . Let A, B, and C be distinct subsets of V such that C separates A and B. Then A and B are conditionally independent given C

• Unconnected A and B (i.e. no path between, or separated by the empty set) are independent • Pairwise and global Markov properties are equivalent

26 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Markov Random Fields

MRF – also: Markov network, or undirected graphical model • Each node corresponds to a random variable or a group of variables • : edges for the nearest 4-neighbourhood

http://patspam.com/applets/physics/ising%20model/fig1.jpg • Clique – a subset of nodes being all adjacent to each other • Maximal clique – a non-extendable clique • Potential – any positive function

27 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Markov Random Fields

Theorem (Hammersley-Clifford)

Joint distribution of the MRF variables P (x1, . . . , xk) is factored over the maximum cliques C of the graph: 1 Y P (x , . . . , x ) = ψ (x ≡ {x : i ∈ C}) 1 k Z C C i C • Z – a normalisation constant (the partition function): X Y Z = ψC (xC ) x∈X C • ψC (xC ) – a strictly positive potential function:

ψC (xC ) = exp (−E(xC )) P • E(xC ) – an energy function (total energy: E(xC )) C

28 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Markov Random Fields

• Clique – a complete subgraph (each node pair is linked by an edge) • Clique of order 1 – a node itself: • Clique of order 2 – each linked pair of nodes:

• Clique of order 3 – each linked triple of nodes:

• Clique of order 4: connected quadruples: • 2D Ising MRF: • Nodes – points of an arithmetic lattice R = ((i, j): i = 0,...,I − 1; j = 0,...,J − 1) • Signals xi,j with only two values {−1, 1} • Cliques of order 2 linking the nearest 4-neighbours: ((i, j), (i + 1, j)); ((i, j), (i, j + 1)) ∈ R2 P • Energy function E(x) = − γxi,j (xi+1,j + xi,j+1) i,j • Coefficient γ is known or learned from a given training sample

29 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

2D Ising MRF – Pixel-wise Transitions

i, j + 1

i − 1, j i, j i + 1, j

i, j − 1

1 Probability of a sample x = (xi,j :(i, j) ∈ R): P (x) = Z exp (−E(x)) [i,j] Transition probability (x : the sample x except the point xi,j ): P (x) P (x) exp (γx S ) p (x ) = ≡ = i,j i,j i,j i,j [i,j] P P (x ) P (x) exp (−γSi,j) + exp (γSi,j) xi,j ∈{−1,1}

where Si,j = xi−1,j + xi+1,j + xi,j−1 + xi,j+1 30 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Simulating Samples of a 2D Ising MRF

Two conditionally independent sub-lattices:

R0 = {(i, j):(i, j) ∈ R;(i + j) mod 2 = 0} R1 = {(i, j):(i, j) ∈ R;(i + j) mod 2 = 1}

31 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Markov Chain Monte Carlo Simulation

Pseudocode: for a ∈ {0, 1} for (i, j) ∈ Ra S = −γ (xi−1,j + xi+1,j + xi,j−1 + xi,j+1) // xα,β = 0 if (α, β) ∈/ R exp (−S) p = ; p = 1 − p −1 exp (−S) + exp (S) 1 −1 if r = random() < p−1 then xi,j = −1; else xi,j = 1; end for end for For γ = 0.1: For γ = 1.0: −1

−1 xi,j −1 S = −4γ = −0.4 S = −4γ = −4.0 e0.4 e4.0 p−1 = 0.4 −0.4 = 0.69 p−1 = 4.0 −4.0 = 0.9997 −1 e +e e +e p1 = 1 − 0.69 = 0.31 p1 = 1 − 0.9997 = 0.0003

32 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields

Markov Chain Monte Carlo Simulation For γ = 0.1: For γ = 1.0: −1

−1 xi,j +1 S = −2γ = −0.2 S = −2γ = −2.0 e0.2 e2.0 p−1 = 0.2 −0.2 = 0.60 p−1 = 2.0 −2.0 = 0.982 −1 e +e e +e p1 = 1 − 0.60 = 0.40 p1 = 1 − 0.982 = 0.018 For γ = 0.1: For γ = 1.0: −1

+1 xi,j +1 S = 0 · γ = 0.0 S = −0 · γ = −0.0 e0 e0 p−1 = 0 −0 = 0.50 p−1 = 0 −0 = 0.50 −1 e +e e +e p1 = 1 − 0.50 = 0.50 p1 = 1 − 0.50 = 0.50

Generated samples {xt; t = 0, 1,...,T } are distributed according to the 2D Ising MRF P (x) when t → ∞

http://www.physics.cornell.edu/sethna/teaching/StatPhys/GIF/ScalingIsing.gif 33 / 33