Bayesian Networks

ALGORITHMS FOR DECISION SUPPORT Bayesian networks Guest Lecturer: Silja Renooij (Thanks to Thor Whalen for kindly contributing to these slides) Probabilistic Independence Conditional (In)dependence Conditional (In)dependence Chain rule & independence Independence & space/time complexity Efficient representation of independence Bayesian network (BN) Bayesian network queries arg maxh P(H = h | E = e) = arg maxh P(H = h,E = e) P(H = h,E = e) P(H = h | E = e) = ∝ P(H = h,E = e) P(E = e) Complexity of queries (decision versions) all NP-hard Inference algorithms • Exact inference – Variable elimination (VE) – Message passing (Pearl) – Junction tree propagation (aka join tree/Hugin prop.) • Approximate inference – Loopy belief propagation – Stochastic sampling (various Monte Carlo methods) – (!) in general, approximation (within a guaranteed margin of error) does not reduce complexity of inference Idea behind the Junction tree algorithm A A Some clever B C BC algorithm D D Many problems that are hard on arbitrary graphs are easy on tree-like structures. Or more specifically…. Clique (or bag) A ABD ACE B C G AD AE CE D E H ADE CEG F DE EG separator DEF EGH Bayesian Network Secondary Structure: • one-dim. stochastic variables Junction Tree • conditional probabilities • multi-dim. stochastic variables • cluster ‘potentials’ Let’s take a couple of steps back… VE example in “Asia” network V S We are interested in P(D) T L - Need to sum out (eliminate): A B V, S, X, T, L, A, B X D Initial factors: P (V ) P(S) P(T |V ) P(L | S) P(B | S) P(A |T, L) P(X | A) P(D | A, B) Brute force: P(D) = ∑∑∑∑∑∑∑ P(v) P(s) P(t | v) P(l | s) P(b | s) P(a | t,l) P(x | a) P(D | a,b) v s x t l a b But let’s try something more elegant… VE example continued V S T L Eliminate variables in order: A B V → S → X → T → L → A → B X D Combine all initial factors using V: P(V ) P(S) P(T |V ) P(L | S) P(B | S) P(A |T, L) P(X | A) P(D | A, B) f (T ) = P(v)P(T | v) [ Note: although fV(T) = P(T), in V ∑ general the result of elimination is not v necessarily a probability term] ⇒ fV (T)P(S)P(L | S)P(B | S)P(A | T, L)P(X | A)P(D | A, B) fV(T) more or less ‘joins’ T and V VE example cntnd V S T L Eliminate variables in order: A B → → → → → → V S X T L A B X D Combine initial factors for this iteration: fV (T) P(S) P(L | S) P(B | S) P(A | T, L) P(X | A) P(D | A, B) [ Note: result of elimination may be a function of several fS (B, L) = ∑ P(s)P(B | s)P(L | s) s variables; L and B thus become ‘connected’ ] ⇒ fV (T ) fS (B, L)P(A |T, L)P(X | A)P(D | A, B) VE example cntnd V S Eliminate variables in order: T L A B V → S → X → T → L → A → B X D Combine factors for this iteration: fV (T) fS (B, L)P(A | T, L)P(X | A)P(D | A, B) [ Note: fX(a) = 1 for f X (A) = ∑ P(x | A) x all values a of A ] ⇒ fV (T ) fS (B, L) f X (A)P(A |T, L)P(D | A, B) VE example cntnd V S T L Eliminate variables in order: A B → → → → → → V S X T L A B X D Combine factors for this iteration: fV (T ) fS (B, L) f X (A)P(A | T, L)P(D | A, B) [ Note: factors f can fT (A, L) = ∑ fV (t)P(A | t, L) include other f ’s; this t factor ‘joins’ T and L] ⇒ fS (B, L) f X (A) fT (A, L)P(D | A, B) VE example cntnd V S T L Eliminate variables in order: A B → → → → → → V S X T L A B X D Combine factors for this iteration: fS (B, L) f X (A) fT (A, L)P(D | A, B) [ Note: ‘joins’ A and B] f L (A, B) = ∑ fS (B,l) fT (A,l) l ⇒ f L (A, B) f X (A)P(D | A, B) VE example cntnd V S T L Eliminate variables in order: A B V → S → X → T → L → A → B X D Combine factors for this iteration: f L (A, B) f X (A)P(D | A, B) f A (B, D) = ∑ f L (a, B) f X (a)P(D | a, B) a ⇒ f A (B, D) VE example cntnd V S T L Eliminate variables in order: A B V → S → X → T → L → A → B X D Combine factors for this iteration: f A (B, D) f B (D) = ∑ f A (b, D) b ⇒ f B (D) VE intermediate factors In our previous example: With a different ordering: V → S → X → T → L → A → B A → B → X → T →V → S → L fV (T) V S g A (L,T, D, B, X ) fS (B, L) gB (L,T, A, X , S) T L f X (A) g X (L,T, D, S) f (A, L) T gT (L,T, S,V ) f (A, B) A B L gV (L, D, S) f A (B, D) X D gS (L, D) f B (D) gL (D) Complexity is exponential in the size of these factors! Notes about VE • Actual computation is done in the elimination steps • Computation depends on the order of elimination • For each query we need to compute everything again! – Many redundant calculations Junction Trees • Redundant calculations VE can be avoided by ‘generalising’ to the junction tree (JT) algorithm (introduced by Lauritzen & Spiegelhalter, 1988) • The JT algorithm compiles a class of elimination orders into a data structure that supports the computation of all possible queries. Building a Junction Tree DAG Moral Graph Triangulated Graph Identifying Cliques Junction Tree Step 1: Moralization G = (V , A) G M A A A B C G B C G B C G D E H D E H D E H F F F 1. For all Z ∈ V: • For all X,Y ∈ par(Z) add an edge X Y. 2. Undirect all edges. Step 2: Triangulation GM GT A A B C G B C G D E H D E H F F Add edges to GM such that there is no cycle with length ≥ 4 that does not contain a chord. NO YES Step 3: Identifying Cliques T G A A A C G A B C E B C G G D D E E D E H D E E H F F All maximal cliques (complete subgraphs) of G T Step 4-I: Junction Graph Cliques from GT (incomplete) Junction graph GJ A A A ABD A ACE B C C G AD AE CE G ADE E CEG D D E E E E DE E EG D E E H separators e.g. ADE ∩ DEF = DE DEF E EGH F • A junction graph for an undirected graph G is an undirected, labeled graph. • The nodes are the cliques in G. • If two cliques intersect, they are joined in the junction graph by an edge labeled with their intersection. Step 4-II: Junction Tree A junction tree is a sub-graph of the junction graph that • Is a tree • Contains all the cliques (spanning tree) • Satisfies the running intersection property: for each pair of nodes X, Y, all nodes on the path between X and Y contain X ∩ Y Junction graph G J Junction tree G JT (incomplete) ABD A ACE ABD ACE AD AE CE AD AE CE ADE E CEG ADE CEG E DE E EG DE EG DEF E EGH DEF EGH Running intersection? All cliques Z and separators S along the path between any two nodes X and Y contain the intersection X∩Y. Ex: X={A,B,D}, Y={A,C,E} ⇒ X∩Y={A} C={A,D,E}⊇{A}, S1={A,D}⊇{A}, S2={A,E}⊇{A} ABD ACE Y X AD AE CE S1 S2 ADE CEG Z DE EG DEF EGH Using a Junction Tree for inference DAG Junction Tree Initialization Inconsistent Junction Tree Propagation (message passing) Consistent Junction Tree Marginalization (summing out) P(V = v | E = e) Step 1: Initialization • For each (conditional) distribution from the BN, create a node potential: • Assign each node potential to a single clique C, for which • The clique potential for C is the product of its assigned node potentials Marginalisation and Inconsistency • Potentials in the junction tree can be inconsistent, i.e. computing a marginal P(Xi) from different cliques can give different results: ABD ACE P(A) = φ Σ ACE ce AD AE CE = (0.12, 0.33, 0.11, 0.03) P(A) = φ ADE CEG Σ ADE de = (0.02, 0.43, 0.31, 0.12) DE EG DEF EGH Propagating potentials: idea Message Passing from clique A to clique B 1. Project the potential of A into separator SAB 2. Absorb the potential of separator SAB into B Projection Absorption Global propagation: idea Root 1. Choose a root ABD ACE 2 3 7 2. COLLECT-EVIDENCE AD AE CE (messages 1-5: leafs to root. 6 5 NB corresponds with a 9 perfect elimination order!) ADE CEG 1 DE 8 4 EG 10 3. DISTRIBUTE-EVIDENCE (messages 6-10: root to leafs) DEF EGH After global propagation, potentials are consistent and marginalisation gives correct results.

Load more