Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Probabilistic Modelling
Georgy Gimel’farb
COMPSCI 369 Computational Science
1 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
1 Random walks
2 Markov Chains
3 Bayesian Networks
4 Markov Random Fields
Learning outcomes on probabilistic modelling: Be familiar with basic probabilistic modelling techniques and tools • Be familiar with basic probability theory notions and Markov chains • Understand the maximum likelihood (ML) and identify problems ML can solve • Recognise and construct Markov models and hidden Markov models (HMMs) • Recognise problems amenable to Monte Carlo algorithms and be able to identify which computational tools can be best used to solve them
Recommended reading: • G. Strang, Computational Science and Engineering. Wellesley-Cambridge Press, 2007: Section 2.8 • C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006: Chapters 1, 2, 8, 11 • L. Wasserman, All of Statistics: A Concise Course of Statistical Inference. Springer, 2004: Chapter 17
2 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
What Is a Random Walk?
• 1D, 2D, 3D, or generally d-D trajectory consisting of successive random steps • Fundamental model for a random process evolving in time • Applications: computer science, physics, ecology, economics, . . . • Random walk hypothesis – a financial theory stating that stock market prices evolve as a random walk and thus cannot be predicted on the basis of the past movement.
3 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Random 1D Walk
−3∆ −2∆ −∆ 0 ∆ 2∆ 3∆ 1D grid ......
1D walk P (−∆) = 1 − p P (+∆) = p
• The drunkard’s walk: p = 0.5
• Distance Ln from the origin after n independent steps: • Expected (mean) distance: E[Ln] = n∆(2p − 1) 2 • Distance variance: V[Ln] = 4n∆ p(1 − p) p p • Standard deviation: sn ≡ V[Ln] = 2∆ np(1 − p) 2 √ • If p = 0.5: E[Ln] = 0; V(Ln) = n∆ (or sn = n∆)
4 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
1D Random Walk - A Few Numerical Examples
Step length ∆ = 1; P (+1) = P (−1) = 0.5 (the drunkard’s walk): √ E[Ln] = 0; sn = n Step n 1 10 100 1, 000 10, 000 100, 000 1, 000, 000 Mean E[Ln] 0 0 0 0 0 0 0 St. d. sn 1 3.2 10 31.6 100 316.2 1, 000 Step length ∆ = 1; P (+1) = 0.64; P (−1) = 0.36: √ E[Ln] = 0.28n; sn = 0.96 n Step n 1 10 100 1, 000 10, 000 100, 000 1, 000, 000 Mean E[Ln] 0.28 2.8 28 280 2, 800 28, 000 280, 000 St. d. sn 0.96 3.0 9.6 30.4 96 303.5 960 Step length ∆ = 1; P (+1) = 0.9; P (−1) = 0.1: √ E[Ln] = 0.8n; sn = 0.6 n Step n 1 10 100 1, 000 10, 000 100, 000 1, 000, 000 Mean E[Ln] 0.8 8 80 800 8, 000 80, 000 800, 000 St. d. sn 0.6 1.9 6 19 60 189.7 600
5 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Pseudocode to Simulate an 1D Random Walk
Unit step: ∆ = ±1; P+1 + P−1 = 1
Pleft ≡ P−1 Pright ≡ P+1
Threshold: T = Pright r = random() //a computed pseudo-random //number: 0 ≤ r ≤ 1 if r ≤ T then move right //∆ = +1 else move left //∆ = −1
Example (P+1 = 0.75): n r ∆ Ln n r ∆ Ln n r ∆ Ln 1 0.84 −1 −1 6 0.20 +1 −2 11 0.48 +1 1 2 0.39 +1 0 7 0.34 +1 −1 12 0.63 +1 2 3 0.78 −1 −1 8 0.77 −1 −2 13 0.36 +1 3 4 0.80 −1 −2 9 0.28 +1 −1 14 0.51 +1 4 5 0.91 −1 −3 10 0.55 +1 0 15 0.95 −1 3
6 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Simulated 1D Random Walk (P+1 = 0.75)
6 Ln E[Ln] 5 4 3 2 1 n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 −1 −2 −3 −4 n 10 20 30 40 50 60 70 80 90 100 ... L 0 6 12 16 16 24 26 32 36 36 ... −5 n E[Ln] 5 10 15 20 25 30 35 40 45 50 ... −6 sn 2.74 3.87 4.74 5.48 6.12 6.71 7.25 7.75 8.22 8.66 ...
7 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
2D and 3D Random Walks
2D walk 3D walk y y
x x
z
8 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Example: 2D Random Walk
9 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Example: 3D Walks http://www.audienceoftwo.com/pics/upload/542px-Walk3d 0.png
10 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Pseudocode to Simulate a 2D Random Walk
Pup ≡ P0,+1
Pleft ≡ P−1,0 Pright ≡ P+1,0 Pright + Pup + Pleft + Pdown = 1
Pdown ≡ P0,−1
Thresholds: T1 = Pright; T2 = T1 + Pup; T3 = T2 + Pleft
r = random() //a computed pseudo-random //number: 0 ≤ r ≤ 1 if r ≤ T1 then move right //∆x = 1; ∆y = 0 else if r ≤ T2 then move up //∆x = 0; ∆y = 1 else if r ≤ T3 then move left //∆x = −1; ∆y = 0 else then move down //∆x = 0; ∆y = −1
11 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Some Properties of Random Walks
• Gambler’s ruin, or recurrence phenomenon: a simple 1D random walk (P−1 = P+1 = 0.5) crosses every point an infinite number of times • Gambler with a finite amount of money playing a fair game against a bank with infinite funds will surely lose! • Probability Pr(d) that a random walk on a d-D hypercubic lattice returns to the origin:
Pr(1) = 1; Pr(2) = 1 Recurrent walks: d ≤ 2
Pr(3) = 0.3405 ... ; Pr(4) = 0.1932 ... Transient walks: d ≥ 3
• Drunkard eventually gets back to his house from the bar if his random walk is on the set of all points in the line or plane with integer coordinates • But in three dimensions, the probability of returning decreases to roughly 34%
12 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Markov Chains
x1 ... xn−1 xn xn+1 ... xN
• 1st-order Markov chain: a series of random variables x1,..., xN with the conditional independence property: for n = 1,...,N − 1
P (xn+1|x1,..., xn) = P (xn+1|xn)
• Homogeneous Markov chain: the same transition probabilities for all n • Transition matrix
α,β=K P = [pαβ]α,β=1
α,β=K ≡ [P (xn+1 = β|xn = α)]α,β=1
13 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Markov Chains
x1 ... xn−1 xn xn+1 ... xN
Invariant marginal distribution for a homogeneous chain:
∗ X ∗ P (xn+1) = P (xn+1|xn)P (xn) xn • A given Markov chain may have more than one invariant distribution Detailed balance – a sufficient (but not necessary) condition of invariance:
∗ ∗ P (xn+1)P (xn|xn+1) = P (xn)P (xn+1|xn)
Reversible Markov chain: if the detailed balance holds for it
14 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Markov Chain: An Example 1D random walk with reflecting barriers: at 1 2 3 4 5 each step n, the chain variable x[n] take values from {1, 2, 3, 4, 5}
[n+1] [n] α,β=5 Transition matrix P ≡ P (x = β|x = α) α,β=1
0 1 − p 0 0 0 1 0 1 − p 0 0 = 0 p 0 1 − p 0 0 0 p 0 1 0 0 0 p 0
(1 − p)3 (1 − p)2 ∗ 1 Invariant p.d. P (x) = 2 p(1 − p) 1+(1−2p) p2 p3
15 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Markov Chains
• Ergodicity: if irrespectively of P (x1) the distribution P (xn) for n → ∞ converges to the required invariant distribution P ∗(x) • A homogeneous Markov chain is ergodic under weak restrictions on the invariant distribution and the transition probabilities • The invariant distribution is called the equilibrium distribution • An ergodic Markov chain has only one equilibrium distribution • Higher-order Markov chains: P (xn+1|x1, . . . , xn) = P (xn+1|xn, . . . , xn−k) • Generally, not necessarily the nearest k dependencies
16 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
First-order Markov Chains
• m-step transition probability of going from state α to state β in m steps: pαβ(m) = P (xn+m = β|xn = α)
• Chapman–Kolmogorov equations: pαβ(m + n) = P pαγ(m)pγβ(n) or γ P P (xk+m = γ|xk = α)P (xk+m+n = β|xk+m = γ) γ P = P (xk+m+n = β, xk+m = γ|xk = α) γ ≡ P (xk+m+n = β|xk = α)
• In the matrix form: P(1) = P by definition; P(n) = Pn; P(m + n) = P(m)P(n)
17 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Simulation of a Homogeneous Markov Chain
• Initial data: the marginal p.d. P0(x) and transition matrix P
• Sample x0 = a from the initial marginal distribution P0(x), i.e. take randomly one of the values according the marginals P0(x)
Example: x ∈ {0, 1, 2}; P0(0) = 0.2; P0(1) = 0.3; P0(2) = 0.5 r = random(); //a pseudo-random number in [0, 1] if r ≤ 0.2 then x = 0; else if r ≤ 0.5 then x = 1; else x = 2
• For n = 0,...,N − 1, sample the next xn+1 = b for the current xn = a from the conditional p.d. Pn+1(x) = P(xn+1 = x|xn = a)
Example: xn = 2; Pn+1(0) = 0.1; Pn+1(1) = 0.2; Pn+1(2) = 0.7 r = random(); //a pseudo-random number in [0, 1] if r ≤ 0.1 then x = 0; else if r ≤ 0.3 then x = 1; else x = 2
18 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Conditional Independence
Z X Y Directed graph (digraph) G = {V,E} of causal relationships • Sets of nodes V and edges E (ordered pairs of nodes) • Directed path between X and Y : a sequence of arrows all pointing in the same direction • Undirected path between X and Y : a sequence of adjacent nodes (ignoring the direction of arrows) • X is an ancestor of Y and Y is a descendant of X if there is a directed path from X to Y (or X = Y ) • Cycle: a directed path starting and ending at the same node • DAG (directed acyclic graph) – a digraph with no cycles • Conditional independence of X and Y given Z: if fX,Y |Z (x, y|z) = fX|Z (x|z)fY |Z (y|z) for all x, y, and z • Equivalently: f(x|y, z) = f(x|z) – that is, given Z, Y provides no extra information about X
19 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Probability and DAGs
Bayesian network – a digraph endowed with a p.d. • Each node corresponds to a random variable • Edges represent statistical dependencies • A bit misleading term as not only Bayesian inference! • (X,Y ) ∈ E means an arrow points from X to Y : X and Y are called adjacent; X is a parent of Y ; and Y is a child of X
Given a DAG G with vertices V = (X1,...,Xk), a distribution P for V with probability function f is Markov to the graph G if
k Y f(v) ≡ f(x1, . . . , xk) = f(xi|πi) i=1
where πi are the parents of Xi • Equivalent notion: G represents P
20 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Probability and DAGs
Markov Condition A distribution P is represented by a DAG G if and only if every variable W is conditionally independent of all the other variables except the parents and descendants of W , given the parents of W
Markov equivalent graphs: imply the same independence relations • Estimate a distribution f consistent with a DAG G, given G and data sets V1,...,Vn from f n n m Q Q Q • Likelihood: L(θ) = f(Vi; θ) = f(Xij|πj; θj) i=1 i=1 j=1
• Xij – the value of the node Xj in the i-th data set Vi • θj – the parameters of the j-th conditional density
• Estimate G, given data V1,...,Vn (A very challenging problem!) • Fit every possible DAG using MLE and use an Aikake information criterion or other methods to choose a DAG
21 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Bayesian Network: An Example
x2 x1 x4 x6
x3 x5 x7
k Q f(x1, . . . , xk) = f(xj|πj) j=1 = fa(x1)fb(x2)fc(x3)fd(x4|x1, x2, x3) × fe(x5|x1.x3)fg(x6|x4)fh(x7|x4, x5) • The name means only that statistical inference for digraphs can also be performed by Bayesian methods • This model is also called the belief network
22 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Bayesian Network: d-Separation
d-separation (directed separation) of the sets A and B of nodes: • By a set S of nodes such that the variables in A and B are independent given the variables in S, e.g.
A = {x1, x2, x3}; B = {x6, x7}; S = {x4, x5}
x2 x2 x1 x4 x6 x1 x4 x6
x3 x3 x5 x7 x5 x7
23 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Bayesian Network: Markov Blanket
Markov blanket of a node j: • The set of its parents, children, and co-parents • It d-separates j from all other nodes, e.g.
MB1 = {x2, x3, x4, x5}; MB2 = {x1, x3, x4}; MB5 = {x1, x3, x4, x7}
x2 x2 x2 x1 x4 x6 x1 x4 x6 x1 x4 x6
x3 x3 x3 x5 x7 x5 x7 x5 x7
24 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Undirected Graphs
Z X Y Undirected graph G = (V,E): a finite set V of nodes (vertices) and a finite set E of edges (arcs) consisting of pairs of vertices • X and Y are adjacent (neighbours) if (X,Y ) ∈ E or (Y,X) ∈ E
• X0,...,Xn is a path if (Xi,Xi+1) is adjacent for each i • Complete graph: there is an edge between every pair of nodes • Subgraph: a subset U ⊂ V of nodes together with their edges • If A, B, and C are three distinct subsets of V , then C separates A and B if every path from a node in A to a node in B intersects a node in C
25 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Markov Graphs
Pairwise Markov graph: • V – a set of random variables with distribution P • One node for each random variable in V • No edge between every pair of variables being independent given the rest of the variables
Theorem (Global Markov condition) Let G = (V,E) be a pairwise Markov graph for a distribution P . Let A, B, and C be distinct subsets of V such that C separates A and B. Then A and B are conditionally independent given C
• Unconnected A and B (i.e. no path between, or separated by the empty set) are independent • Pairwise and global Markov properties are equivalent
26 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Markov Random Fields
MRF – also: Markov network, or undirected graphical model • Each node corresponds to a random variable or a group of variables • Ising model: edges for the nearest 4-neighbourhood
http://patspam.com/applets/physics/ising%20model/fig1.jpg • Clique – a subset of nodes being all adjacent to each other • Maximal clique – a non-extendable clique • Potential – any positive function
27 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Markov Random Fields
Theorem (Hammersley-Clifford)
Joint distribution of the MRF variables P (x1, . . . , xk) is factored over the maximum cliques C of the graph: 1 Y P (x , . . . , x ) = ψ (x ≡ {x : i ∈ C}) 1 k Z C C i C • Z – a normalisation constant (the partition function): X Y Z = ψC (xC ) x∈X C • ψC (xC ) – a strictly positive potential function:
ψC (xC ) = exp (−E(xC )) P • E(xC ) – an energy function (total energy: E(xC )) C
28 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Markov Random Fields
• Clique – a complete subgraph (each node pair is linked by an edge) • Clique of order 1 – a node itself: • Clique of order 2 – each linked pair of nodes:
• Clique of order 3 – each linked triple of nodes:
• Clique of order 4: connected quadruples: • 2D Ising MRF: • Nodes – points of an arithmetic lattice R = ((i, j): i = 0,...,I − 1; j = 0,...,J − 1) • Signals xi,j with only two values {−1, 1} • Cliques of order 2 linking the nearest 4-neighbours: ((i, j), (i + 1, j)); ((i, j), (i, j + 1)) ∈ R2 P • Energy function E(x) = − γxi,j (xi+1,j + xi,j+1) i,j • Coefficient γ is known or learned from a given training sample
29 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
2D Ising MRF – Pixel-wise Transitions
i, j + 1
i − 1, j i, j i + 1, j
i, j − 1
1 Probability of a sample x = (xi,j :(i, j) ∈ R): P (x) = Z exp (−E(x)) [i,j] Transition probability (x : the sample x except the point xi,j ): P (x) P (x) exp (γx S ) p (x ) = ≡ = i,j i,j i,j i,j [i,j] P P (x ) P (x) exp (−γSi,j) + exp (γSi,j) xi,j ∈{−1,1}
where Si,j = xi−1,j + xi+1,j + xi,j−1 + xi,j+1 30 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Simulating Samples of a 2D Ising MRF
Two conditionally independent sub-lattices:
R0 = {(i, j):(i, j) ∈ R;(i + j) mod 2 = 0} R1 = {(i, j):(i, j) ∈ R;(i + j) mod 2 = 1}
31 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Markov Chain Monte Carlo Simulation
Pseudocode: for a ∈ {0, 1} for (i, j) ∈ Ra S = −γ (xi−1,j + xi+1,j + xi,j−1 + xi,j+1) // xα,β = 0 if (α, β) ∈/ R exp (−S) p = ; p = 1 − p −1 exp (−S) + exp (S) 1 −1 if r = random() < p−1 then xi,j = −1; else xi,j = 1; end for end for For γ = 0.1: For γ = 1.0: −1
−1 xi,j −1 S = −4γ = −0.4 S = −4γ = −4.0 e0.4 e4.0 p−1 = 0.4 −0.4 = 0.69 p−1 = 4.0 −4.0 = 0.9997 −1 e +e e +e p1 = 1 − 0.69 = 0.31 p1 = 1 − 0.9997 = 0.0003
32 / 33 Outline Random walks Markov Chains Bayesian Networks Markov Random Fields
Markov Chain Monte Carlo Simulation For γ = 0.1: For γ = 1.0: −1
−1 xi,j +1 S = −2γ = −0.2 S = −2γ = −2.0 e0.2 e2.0 p−1 = 0.2 −0.2 = 0.60 p−1 = 2.0 −2.0 = 0.982 −1 e +e e +e p1 = 1 − 0.60 = 0.40 p1 = 1 − 0.982 = 0.018 For γ = 0.1: For γ = 1.0: −1
+1 xi,j +1 S = 0 · γ = 0.0 S = −0 · γ = −0.0 e0 e0 p−1 = 0 −0 = 0.50 p−1 = 0 −0 = 0.50 −1 e +e e +e p1 = 1 − 0.50 = 0.50 p1 = 1 − 0.50 = 0.50
Generated samples {xt; t = 0, 1,...,T } are distributed according to the 2D Ising MRF P (x) when t → ∞
http://www.physics.cornell.edu/sethna/teaching/StatPhys/GIF/ScalingIsing.gif 33 / 33